SlideShare a Scribd company logo
Speedup your Analytics:
Automatic Parameter Tuning for
Databases and Big Data Systems
VLDB 2019 Tutorial
Jiaheng Lu, University of Helsinki
Yuxing Chen, University of Helsinki
Herodotos Herodotou, Cyprus University of Technology
Shivnath Babu, Duke University / Unravel Data Systems
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 2
Outline
Motivation and Background
History and Classification
Parameter Tuning on Databases
Parameter Tuning on Big Data Systems
Applications of Automatic Parameter Tuning
Open Challenges and Discussion
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 3
Modern applications are being built on a
collection of distributed systems
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 4
But:
Running distributed applications
reliably & efficiently is hard
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 5
My app failed
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 6
My data pipeline is missing SLA
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 7
My cloud cost is out of control
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 8
There are many challenges
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 9
Self-driving Systems
Automate physical data layout
Index and view
recommendation
Knob tuning, buffer
pool sizes, etc
Plan
optimization
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 10
Different Optimization Levels of Self-driving Systems
The focus of
this tutorial
Automate physical data layout
Index and view
recommendation
Knob tuning, buffer
pool sizes, etc
Plan
optimization
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 11
Effectiveness of Knob (Parameter) Tuning
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 12
Selected Performance-aware Parameters in PostgreSQL
Parameter Name Description Default value
bgwriter_lru_maxpages Max number of buffers written by the background writer 100
checkpoint_segments Max number of log file segments between WAL checkpoints 3
checkpoint_timeout Max time between automatic WAL checkpoints 5 min
deadlock_timeout Waiting time on locks for checking for deadlocks 1 sec
effective_cache_size Size of the disk cache accessible to one query 4 GB
effective_io_concurrency Number of disk I/O operations to be executed concurrently 1 or 0
shared_buffers Memory size for shared memory buffers 128MB
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 13
Selected Performance-aware Parameters in Hadoop
Parameter Name Description Default value
dfs.block.size The default block size for files stored HDFS 128MB
mapreduce.map.tasks Number of map tasks 2
mapreduce.reduce.tasks Number of reduce tasks 1
mapreduce.job.reduce
.slowstart.completedmaps
Min percent of map tasks completed before scheduling
reduce tasks
0.05
mapreduce.map.combine
.minspills
Min number of map output spill files present for using the
combine function
3
mapreduce.reduce.merge.in
mem.threshold
Max number of shuffled map output pairs before initiating
merging during the shuffle
1000
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 14
Selected Performance-aware Parameters in Spark
Parameter Name Description Default value
spark.driver.cores Number of cores used by the Spark driver process 1
spark.driver.memory Memory size for driver process 1 GB
spark.sql.shuffle.partitions Number of tasks 200
spark.executor.cores The number of cores for each executor 1
spark.files.maxPartitionBytes Max number of bytes to group into one partition
during file reading
128MB
spark.memory.fraction Fraction for execution and storage memory. It may
cause frequent spills or cached data eviction if given a
low fraction
0.6
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 15
Challenge 1: A Huge Number of Systems
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 16
Challenge 2: Many Parameters in a Single System
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 17
Challenge 3: Diverse Workloads and System Complexity
An example of Spark framework
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 18
Franco Pepe Chef:
“There is no pizza recipe. Every time the dough was
made there were no scales, recipes, machinery.”
There is no knob tuning recipe. Every time, we
need to configure the parameters based on the
bottleneck of different jobs and environment.
-- VLDB 2019 tutorial
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 19
Running Examples of Parameter Tuning (Hadoop)*
Workloads
➢ Terasort: Sort a terabyte of data
➢ N-gram: Compute the inverted list of N-gram data
➢ PageRank: Compute pagerank of graphs
Hadoop platform with MapReduce
* Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang:
MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs.
PVLDB 7(13): 1319-1330 (2014)
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 20
Running Examples of Parameter Tuning (Hadoop)
Problem: Given a MapReduce (or Spark) job with input data and running
cluster, we want to find the setting of parameters that optimize the execution
time of the job (i.e., minimize the job execution time)
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 21
Tuned Key Parameters in Hadoop
Parameter Name Description
MapInputSplit Split number for map jobs
MapOutputBuffer Buffer size of map output
MapOutputCompression Whether the map output data is compressed
ReduceCopy Time to start the copy in Reduce phase
ReduceInputBuffer Input buffer size of Reduce
ReduceOutput Reduce output block size
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 22
Impact of Parameters on Selected Jobs
TeraSort Job N-gram Job PageRank Job
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 23
Comparison between Hadoop-X and MRTuner with
Different Parameters
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 24
50 Years of Knob Tuning
Rule-based &
DBA guideline
(1960s)
Cost-model
(1970s)
Experiments-driven
(1980s)
Machine learning (2007)
Adaptive model
(2006)
Simulation
(2000s)
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 25
Classification of Existing Approaches
Approach Main Idea
Rule-based Based on the experience of human experts
Cost Modeling Using statistical cost functions
Simulation-based Modular or complete system simulation
Experiment-driven Execute an experiment with different parameter settings
Machine Learning Employ machine learning methods
Adaptive Tune configuration parameters adaptively
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 26
Rule-based Approach
➢ Assist users based on the experience of human experts
Parameter Name Default Description Recommendation
dfs.replication in
HDFS
3 Lower it to reduce
replication cost.
Higher replication can make
data local to more workers,
but more space overhead
IF (Running time is
the critical goal and
enough space)
Set 5
Otherwise
Set 3
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 27
Cost Modeling Approach
➢ Build performance prediction models by using statistical cost functions
Cost Constants
Cost Formulas
Cost Estimation
Operations
Parameter Values
Statistics
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 28
Simulation-based Approach
➢ Build performance models based on modular or complete simulation
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 29
Experiment-driven Approach
➢ Execute the experiments repeatedly with different parameter settings,
guided by a search algorithm
Input knobs
Recommended knobs
Goal
Experiments
Exploit knobs
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 30
Machine Learning Approach
➢ Establish performance models by employing machine learning methods
➢ Consider the complex system as a whole and assume no knowledge of
system
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 31
Adaptive Approach
➢ Tune parameters adaptively while an application is running
➢ Adjust the parameter settings as the environment changes
Online
CLOT (2006) strategy
New query
New environment
Self-tuning Module
Recommend
index selection
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 32
Outline
Motivation and Background
History and Classification
Parameter Tuning on Databases
Parameter Tuning on Big Data Systems
Applications of Automatic Parameter Tuning
Open Challenges and Discussion
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 33
What and How to Tune?
➢ What to configure?
❖ Which parameters (knobs)?
❖ Which are most important?
➢ How to tune (to best throughput)?
❖ Increase buffer size?
❖ More parallelism on writing?
I am a database
I am running queries
Figure. Tuning guitar knobs to right notes (frequencies)
Run faster?
Higher throughput?
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 34
What to Tune – Some Important Knobs for throughput
Parameter Name Brief Description and Use Deafult
bgwriter_delay Background writer’s delay between activity rounds 200ms
bgwriter_lru_maxpages Max number of buffers written by the background
writer
100
checkpoint_segments Max number of log file segments between WAL
checkpoints
3
checkpoint_timeout Max time between automatic WAL checkpoints 5min
deadlock_timeout Waiting time on locks for checking for deadlocks 1s
default_statistics_target Default statistics target for table columns 100
effective_cache_size Effective size of the disk cache accessible to one
query
4GB
shared_buffers Memory size for shared memory buffers 128MB
Memory
Cache
Timeout
Settings
Threads
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 35
What are the Important Parameters and How to Choose
➢ Affect the performance most (manually)
❖ Based on expert experiences
❖ Default documentation
If you want higher throughput,
better tuning memory-related
parameters
Performance-sensitive
parameters are important!
Parameters have strong
correlation to performance are
important!
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 36
What are the Important Parameters and How to Choose
➢ Affect the performance most
➢ Strongest correlation between parameters and objective function (model)
❖ Linear regression model for independent parameters:
❑ Regularized version of least squares – Lasso (OtterTune 2017)
✔ Interpretable, stable, and computationally efficient with higher dimensions
❖ Deep learning model (CBDTune 2019)
❑ The important input parameters will gain higher weights in training
Weights Knobs
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 37
How to Tune – Key Tuning Goals
➢ Avoidance: to identify and avoid error-prone configuration settings
➢ Ranking: to rank parameters according to the performance impact
➢ Profiling: to classify and store useful log information from previous runs
➢ Prediction: to predict the database or workload performance under
hypothetical resource or parameter changes
➢ Tuning: to recommend parameter values to achieve objective goals
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 38
How to Tune – Tuning Methods
Methods Approach Methodology Target Level
Rule-based SPEX (2013) Constraint inference Avoidance
Xu (2015) Configuration navigation Ranking
Cost-model STMM (2006) Cost model Tuning
Simulation-
based
Dushyanth (2005) Trace-based simulation Prediction
ADDM (2005) DAG model & simulation Profiling, tuning
Experiment
driven
SARD (2008) P&B statistical design Ranking
iTuned (2009) LHS & Guassian Process Profiling, tuning
Machine
Learning
Rodd (2016) Neural Networks Tuning
OtterTune (2017) Guassian Process Ranking, tuning
CDBTune (2019) Deep RL Tuning
Adaptive COLT (2006) Cost Vs. Gain analysis Profiling, tuning
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 39
Relational Database Tuning Methods
Rule-based
Cost Modeling
Simulation-based
Experiment-driven
Machine Learning
Covered #knobs
Training cost
Time
Rule-based
Simulation-based
Experiment-driven
Machine Learning
Adaptive
Cost Modeling
Figure. Required expert knowledge on systemFigure. Developing trend: putting more
training cost to uncover more knobs
2005 2017
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 40
Tuning Method: Rule-based
➢ Tuning based on rules derived from DBAs’ expertise, experience,
and knowledge, or Rule of Thumb default recommendation
Expert
Guarantee cache
memory to accelerate
queries …
Rule of Thumb
Better not change the
deadlock timeout if …
Documents
Default settings work
most of the cases …Rules
Trusted
Parties
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems
➢ A cost model establishes a performance model by cost functions
based on the deep understanding of system components
Cost Model
41
Tuning Method: Cost Modeling
Cost Constants
Cost Formulas
Cost Estimate
Operations
Parameter Values
Statistics
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 42
Tuning Method: Cost Modeling (STMM)
➢ STMM: Adaptive Self-Tuning Memory in DB2 (2006)
❖ Reallocates memory for several critical components(e.g., compiled
statement cache, sort, and buffer pools)
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 43
Tuning Method: Simulation-based
➢ A simulation-based approach simulates workloads in one
environment and learns experience or builds models to predict
the performance in another.
Running job here is (1) expensive
or (2) slowdown concurrent jobs or
(3)…
Simulate it in small
environment with tiny
portion of data …
Often product environment Often test environment
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 44
Tuning Method: Experiment-driven
➢ An experiment-driven approach relies on repeated executions of
the same workload under different configuration settings towards
tuning parameter values
Input knobs
Recommended knobs
Goal
Experiments
Exploit knobs
Classic paper: Tuning Database Configuration Parameters with iTuned. 2009
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 45
Tuning Method: Machine Learning
➢ Machine Learning (ML) approaches aim to tune parameters
automatically by taking advantages of ML methods.
ML Training
Input knobs
Recommended knobs
Goal
ML Model
Training logs
Actual run logs
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 46
Tuning Method: Machine Learning (OtterTune 2017)
➢ Factor Analysis: transform high dimension parameters to few factors
➢ Kmeans: Cluster distinct metrics
➢ Lasso: Rank parameters
➢ Gaussian Process: Predict and tune performance
Figure. OtterTune system architecture
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 47
Tuning Method: Machine Learning (CDBTune 2019)
➢ Reinforcement learning
▪ State: knobs and metrics
▪ Reward: performance change
▪ Action: recommended knobs
▪ Policy: Deep Neural network
➢ Key idea
▪ Feedback: try-and-error method
▪ Recommend -> good/bad
▪ Deep deterministic policy gradient
▪ Actor critic algorithm
Figure. CDBTune Deep deterministic policy gradient
Reward: Throughput and latency
performance change Δ from time t − 1
and the initial time to time t
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems
Online
48
Tuning Method: Adaptive
➢ An adaptive approach changes parameter configurations online as
the environment or query workload changes
Figure. CLOT (2006) strategy
New query
New environment
Self-tuning Module
Recommend
index selection
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 49
The Differences of Tuning Database & Big Data Systems in
research papers
Relational Database Big Data System
Parameters More parameters on memory More parameters on vcores
Resource Often fixed resources Now more varying resources
Scalability Often single machine Often many machines in a
distributed environment
Metrics Throughput, latency Time, resource cost
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 50
References (1/1)
➢ Narayanan, Dushyanth, Eno Thereska, and Anastassia Ailamaki. "Continuous resource monitoring for self-predicting
DBMS." 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and
Telecommunication Systems. IEEE, 2005.
➢ Dias, K., Ramacher, M., Shaft, U., Venkataramani, V., & Wood, G.. Automatic Performance Diagnosis and Tuning in
Oracle. In CIDR (pp. 84-94), 2005
➢ Schnaitter, K., Abiteboul, S., Milo, T., & Polyzotis, N. Colt: continuous on-line tuning. In Proceedings of the 2006 ACM
SIGMOD international conference on Management of data (pp. 793-795). ACM, 2006. (CLOT)
➢ Storm, Adam J., et al. "Adaptive self-tuning memory in DB2." Proceedings of the 32nd international conference on
Very large data bases. VLDB Endowment, 2006. (STMM)
➢ Debnath, Biplob K., David J. Lilja, and Mohamed F. Mokbel. "SARD: A statistical approach for ranking database tuning
parameters." 2008 IEEE 24th International Conference on Data Engineering Workshop. IEEE, 2008
➢ Babu, S., Borisov, N., Duan, S., Herodotou, H., & Thummala, V. Automated Experiment-Driven Management of
(Database) Systems. In HotOS, 2009.
➢ Duan, Songyun, Vamsidhar Thummala, and Shivnath Babu. "Tuning database configuration parameters with
iTuned." Proceedings of the VLDB Endowment 2.1: 1246-1257, 2009. (iTuned)
➢ Rodd, Sunil F., and Umakanth P. Kulkarni. "Adaptive tuning algorithm for performance tuning of database
management system." arXiv preprint arXiv:1005.0972, 2010.
➢ Van Aken, D., Pavlo, A., Gordon, G. J., & Zhang, B.. Automatic database management system tuning through large-
scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data(pp. 1009-
1024). ACM, 2017. (OtterTune)
➢ Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., & Ran, M. An end-to-end automatic cloud database tuning system
using deep reinforcement learning. In Proceedings of the 2019 International Conference on Management of Data
(pp. 415-432). ACM, 2019. (CBDTune)
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 51
Outline
Motivation and Background
History and Classification
Parameter Tuning on Databases
Parameter Tuning on Big Data Systems
Applications of Automatic Parameter Tuning
Open Challenges and Discussion
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 52
Ecosystems for Big Data Analytics
MapReduce-based Systems Spark-based Systems
Resource Managers
Distributed File Systems
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 53
Executing Analytics Workloads
Goal:
Execute MapReduce workload
Decisions:
Task
Task
Task
Input
Output
Goal:
Execute an analytics workload in < 2 hours
➢ Task parallelism
➢ Use compression
➢ …
➢ Container settings
➢ Executor cores
➢ …
➢ Number of nodes
➢ Machine specs
➢ …
C C
NMDN
C C
NMDN
ResourcesPlatformJob
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 54
Effect of Job-level Configuration Parameters
➢ 190+ parameters in Hadoop, 15-20 impact performance
➢ 200+ parameters in Spark, 20-30 impact performance
Scenario: 2 MapReduce jobs, 50GB, 16-node EC2 cluster
Word Co-occurrence Terasort
Two-dimensional projections of a multi-dimensional surface
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 55
Tuning Challenges
➢ High-dimensional space of configuration parameters
➢ Non-linear effect of hardware/applications/parameters on performance
➢ Heavy use of programming languages (e.g., Java/Python)
➢ Lack of schema & statistics for input data residing in files
➢ Terabyte-scale data cycles
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 56
Applying Cost-based Optimization
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 57
Applying Cost-based Optimization
Profile
Collect
concise &
general
summaries
of execution
Predict
Estimate
impact of
hypothetical
changes on
execution
Optimize
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 58
Profiling MapReduce Job Execution
⮚ Use dynamic instrumentation
❖ Support unmodified
MapReduce programs
⮚ Use sampling techniques
❖ Minimize overhead of
profiling
Dataflow
Statistics
Dataflow
Counters
Cost
Statistics
Cost
Counters
MapReduce
Job Execution
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 59
Predicting Job Profiles in Starfish
Dataflow
Statistics
Dataflow
Counters
Cost
Statistics
Cost
Counters
Dataflow
Statistics
Dataflow
Counters
Cost
Statistics
Cost
Counters
Cardinality
Models
Relative Black-box
Models
Analytical
Models
Analytical
Models
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 60
Job Optimization & Resource Provisioning
Dataflow
Statistics
Dataflow
Counters
Cost
Statistics
Cost
Counters
Cluster
Resources r2
Input
Data d2
Job Optimizer
Enumerate
Independent
Subspaces
Recursive
Random Search
Input
Data d2
Elastisizer
Enumerate
Cloud Resource
Options
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 61
MapReduce Cost Modeling Approaches
Approach Modeling Optimization Target Level
Starfish (2011-13) Analytical & relative
black box models
Recursive Random
Search
Job, Platform,
Cloud
ARIA (2011) Analytical models Lagrange Multipliers Job, Platform
HPM (2011) Scaling models & LR Brute-force Search Platform
Predator (2012) Analytical models Grid Hill Climbing Job
MRTuner (2014) PTC analytical models Grid-based search Job, Platform
CRESP (2014) Analytical models & LR Brute-force Search Platform, Cloud
MR-COF (2015) Analytical models &
MRPerf simulation
Genetic Algorithm Job
IHPM (2016) Scaling models & LWLR Lagrange Multipliers Platform, Cloud
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 62
Spark Cost Modeling Approaches
➢ Focus:
➢ Unique features:
❖ Ernest (2016): Focus on machine learning Spark applications
❖ Assurance (2017): Mixes white-box models with simulation
❖ DynamiConf (2017): Optimizes degree of parallelism for tasks
Predict performance on
large clusters, full data
Profile using small
clusters, data samples
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 63
Cost Modeling Approach: Pros & Cons
Pros Very efficient for predicting performance
Good accuracy in many (not complex) scenarios
Cons Hard to capture complexity of system internals &
pluggable components (e.g., schedulers)
Models often based on simplified assumptions
Not effective on heterogeneous clusters
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 64
Simulation-based Approach
➢ Key Objective: Accurately predict MapReduce job performance at fine
granularity
❖ Sorry, no fully-fledged Spark simulator available at this point!
➢ Use cases:
❖ Find optimal configuration settings
❖ Find cluster settings based on user requirements
❖ Identify performance bottlenecks
❖ Test new pluggable components (e.g., schedulers)
➢ Common technique: discrete event simulation
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 65
HSim: Hadoop Simulator
Detailed fine-grained execution trace
Slave nodeMaster node
JobTracker TaskTracker
MapperSim ReducerSimTask
Heartbeat
Job Reader Cluster Reader
Job Specs
- Data properties
- Conf parameters
Cluster Specs
- Network topology
- Node resources
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 66
Comparison of Hadoop Simulators
Simulator Network
Traffic
Hardware
Properties
MapReduce
Execution
MapReduce
Scheduling
Conf
Parameters
MRPerf (2009) Yes (ns-2) Yes Task sub-phases No Only few
MRSim (2010) Yes (GridSim) Yes Task sub-phases No Several
Mumak (2009) No No Only task level No Only few
SimMR (2011) No No Task sub-phases
FIFO,
Deadline
Only few
SimMapRed
(2011)
Yes (GridSim) Yes Task sub-phases Several Only few
HSim (2013) Yes (GridSim) Yes Task sub-phases FIFO, FAIR Several
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 67
Simulation-based Approach: Pros & Cons
Pros High accuracy in simulating dynamic system
behaviors
Efficient for predicting fine-grained performance
Cons Hard to comprehensively simulate complex
internal dynamics
Unable to capture dynamic cluster utilization
Not very efficient for finding optimal settings
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 68
General Experiment-driven Architecture
[Sample]
Input Data
MR/Spark
Application
Hadoop/Spark Cluster
Parameter
Search Engine
Best
Parameter
Settings
Application
Executor
Conf
Performance
Analyzer
Logs
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 69
Experiment-driven Approaches
[Sample]
Input Data
MR/Spark
Application Best
Parameter
Settings
Performance
Analyzer
Application
Executor
Conf
Parameter
Search Engine
Panacea
(2012)
• Grid search on
independent
subspaces
Gunther
(2013)
• Genetic search
algorithm
Petridis
(2016)
• Search tree
based on
heuristics
AutoTune
(2018)
• Multiple bound
and search
algorithm
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 70
Gounaris (2018) Exp-driven Approach
Test all candidate configurations
and keep best one
Use benchmarking applications
(sort-by-key, shuffling, k-means)
Test parameter values separately and
in pairs (117 runs for 15 parameters)
Create candidate configurations (9
complex parameter configurations)
Offline
Online
Spark
Application
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 71
Experiment-driven Approach: Pros & Cons
Pros Finds good settings based on real test runs on real
systems
Works across different system versions and
hardware
Cons Very time consuming as it requires multiple actual
runs
Not cost effective for ad-hoc analytics applications
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 72
Machine Learning (ML) Approaches
➢ Three categories of ML approaches:
1) Build a historical store and use similarity measures
2) Perform clustering and ML modeling per cluster
3) Train and utilize a ML model per application
Offline Phase
Historical
data
Output
Feature
selection
Modeling
Online Phase
New
job
Best
conf
Prediction
Optimization
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 73
ML Approaches with Historical Stores
Offline
Phase
MR Jobs
Execute &
collect data
Store: <Features, Execution time>
Online
Phase
Kavulya (2010) PStorm (2014)
New MR Job
Similar Jobs & Execution Times
Execution Time Prediction
k-Nearest-Neighbors
Locally-weighted
Linear Regression
MR Jobs
Execute &
collect data
Store: <Features, Job profile>
New MR Job
Job Features
New Job Profile
Sample execution
XGBoost probing
Optimal Configuration
Starfish Optimizer
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 74
ML Approaches with Clustering
Offline
Phase
Online
Phase
AROMA (2012) PPABS (2013)
Job Utilization Data
k-mediod Clustering
Job Clusters
SVM Model per Cluster
Support Vector Machine
New MR Job
Job Utilization Data
SVM Model
Sample execution
Find cluster
Optimal Resources & Configuration
Pattern Search algorithm
New MR Job
Job Utilization Data
Sample execution
Find cluster
Optimal Configuration
Job Utilization Data
k-means++ Clustering
Job Clusters
Optimal Conf per Cluster
Simulated Annealing
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 75
ML Approaches with App Modeling
Offline
Phase
Online
Phase
Sample executions
MR Job / Spark App
Training Data
Machine Learning Model
Training model *
Guolu
(2016)
* Decision Tree
(C5.0)
+ Recursive
Random Search
Hernandez
(2017)
* Boosted
Regression Trees
+ Heuristic
algorithm
Chen
(2015)
* Tree-based
Regression
+ Random Hill
Climbing
RFHOC
(2016)
* Random-Forest
Approach
+ Genetic
Algorithm
MR Job / Spark App
Execution Time Prediction
Employ ML model
Optimal Configuration
Search algorithm +
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 76
Machine Learning Approach: Pros & Cons
Pros Ability to capture complex system dynamics
Independence from system internals and hardware
Learning based on real observations of system
performance
Cons Requires large training sets, which are expensive to
collect
Training from history logs leads to data under-
fitting
Typically low accuracy for unseen analytics
applications
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 77
Adaptive Approach
➢ Key idea: Track execution of a job and change its configuration in an
online fashion in order to improve performance
Map
Wave 1
Map
Wave 3
Map
Wave 2
Reduce
Wave 1
Reduce
Wave 2
Shuffle
1. Collect statistics from previous wave(s)
2. Set better configurations for next wave *
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 78
Adaptive Approach
➢ Key idea: Track execution of a job and change its configuration in an
online fashion in order to improve performance
Map
Wave 1
Map
Wave 3
Map
Wave 2
Reduce
Wave 1
Reduce
Wave 2
Shuffle
1. Collect statistics from previous wave(s)
2. Set better configurations for next wave *
MROnline
(2014)
* Gray-box hill
climbing
Ant
(2014)
* Genetic
Algorithm
JellyFish
(2015)
* Model-based
hill climbing
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 79
The KERMIT (2016) Approach
Master Node
Slave Node
Node
Manager
Slave Node
Node
Manager
Slave Node
Node
Manager
YARN
Resource
Manager
Container
Spark AppContainer
Resource
requests
Container
MR App
Container
Resource
requests
KERMIT
1. Observe container
performance
2. Adjust memory & CPU
allocations to maximize
performance
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 80
Adaptive Approach: Pros & Cons
Pros Finds good settings based on actual task runs
Able to adjust to dynamic runtime status
Works well for ad-hoc analytics applications
Cons Only applies to long-running analytics applications
Inappropriate configuration can cause issues (e.g.,
stragglers)
Neglects efficient resource utilization in the whole
system
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 81
Outline
Motivation and Background
History and Classification
Parameter Tuning on Databases
Parameter Tuning on Big Data Systems
Applications of Automatic Parameter Tuning
Open Challenges and Discussion
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 82
Auto Parameter Tuning in Database Systems
➢ Oracle Self-driving Database
❖ Automatically set various memory parameters and use of
compression using machine learning
➢ IBM DB2 Self-tuning Memory Manager
❖ Dynamically distributes available memory resources among
buffer pools, locking memory, package cache, and sort memory
➢ Azure SQL Database Automatic Tuning
❖ Memory buffer settings, index management, plan choice
correction
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 83
Auto Parameter Tuning in Big Data Systems
➢ Databricks Optimized Autoscaling
❖ Automatically scale number of executors in Spark up and
down
➢ Spotfire Data Science Autotuning
❖ Automatically set Spark parameters for number and
memory size of executors
➢ Sparklens: Qubole’s Spark Tuning Tool
❖ Automatically set memory of Spark executors
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 84
Auto Parameter Tuning with Unravel
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems
spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThreshold 20MB
…
SKEW('orders', 'o_custId') true
spark.catalog.cacheTable(“orders") true
…
PERFORMANCE
85
Today, tuning is often by trial-and-error
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 86
A New World
INPUTS
1. App = Spark Query
2. Goal = Speedup
“I need to make this app faster”
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems
TIME
APPDURATION
In blink of an eye, user
gets recommendations
to make the app 30%
faster
As user finishes
checking email, she
has a verified run
that is 60% faster
User comes back from
lunch. A verified run
that is 90% faster
90%
faster!
87
A New World
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems
Response Surface MethodologyReinforcement Learning
88
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 89
Autotuning Workflow
Monitoring
Data
Historic Data
&
Probe Data
Recommendation
Algorithm
Cluster Services On-premises and Cloud
App,Goal
Orchestrator
Xnext
Probe Algorithm
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 90
Outline
Motivation and Background
History and Classification
Parameter Tuning on Databases
Parameter Tuning on Big Data Systems
Applications of Automatic Parameter Tuning
Open Challenges and Discussion
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 91
Putting it all Together
Approach Pros Cons
Cost Modeling Very efficient for predicting performance
Good accuracy in many (not complex) scenarios
Very efficient for predicting performance
Good accuracy in many (not complex) scenarios
Hard to capture complexity of system internals & pluggable
components (e.g., schedulers)
Models often based on simplified assumptions
Not effective on heterogeneous clusters
Simulation-based High accuracy in simulating dynamic system behaviors
Efficient for predicting fine-grained performance
Hard to comprehensively simulate complex internal dynamics
Unable to capture dynamic cluster utilization
Not very efficient for finding optimal settings
Experiment-driven Finds good settings based on real test runs on real
systems
Works across different system versions and hardware
Very time consuming as it requires multiple actual runs
Not cost effective for ad-hoc analytics applications
Machine Learning Ability to capture complex system dynamics
Independence from system internals and hardware
Learning based on real observations of system
performance
Requires large training sets, which are expensive to collect
Training from history logs leads to data under-fitting
Typically low accuracy for unseen analytics applications
Adaptive Finds good settings based on actual task runs
Able to adjust to dynamic runtime status
Works well for ad-hoc analytics applications
Only applies to long-running analytics applications
Inappropriate configuration can cause issues (e.g., stragglers)
Neglects efficient resource utilization in the whole system
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 92
Open Challenges
Clusters are becoming heterogeneous in nature,
both for compute and storage
The proliferation of Cloud leads to multi-
tenancy, overheads, perf interaction issues
Real-time analytics pushes boundaries on latency
requirements and combination of systems
Ensuring good
and robust
system
performance at
scale poses new
challenges
Thank you!
93
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 94
References (1/6)
➢ Shivnath Babu. Towards Automatic Optimization of MapReduce Programs. In Proceedings of the 1st ACM
Symposium on Cloud Computing (SoCC), pages 137–142. ACM, 2010.
➢ Shivnath Babu, Herodotos Herodotou, et al. Massively Parallel Databases and MapReduce Systems.
Foundations and Trends® in Databases, 5(1):1–104, 2013.
➢ Liang Bao, Xin Liu, and Weizhao Chen. Learning-based Automatic Parameter Tuning for Big Data Analytics
Frameworks. arXiv preprint arXiv:1808.06008, 2018.
➢ Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng.
RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop’s Configuration. IEEE Transactions on Parallel and
Distributed Systems (TPDS), 27(5):1470–1483, 2016.
➢ Chi-Ou Chen, Ye-Qi Zhuo, Chao-Chun Yeh, Che-Min Lin, and Shih-Wei Liao. Machine Learning-Based
Configuration Parameter Tuning on Hadoop System. In Proceedings of the 2015 IEEE International Congress
on Big Data (BigData Congress), pages 386–392. IEEE, 2015.
➢ Keke Chen, James Powers, Shumin Guo, and Fengguang Tian. CRESP: Towards Optimal Resource Provisioning
for MapReduce Computing in Public Clouds. IEEE Transactions on Parallel and Distributed Systems (TPDS),
25(6):1403–1412, 2014.
➢ Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. Improving MapReduce Performance in Heterogeneous
Environments with Adaptive Task Tuning. In Proceedings of the 15th International Middleware Conference,
pages 97–108. ACM, 2014.
➢ Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
Communications of the ACM, 51(1):107–113, 2008.
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 95
References (2/6)
➢ Xiaoan Ding, Yi Liu, and Depei Qian. Jellyfish: Online Performance Tuning with Adaptive Configuration and
Elastic Container in Hadoop YARN. In Proceedings of the 2015 IEEE 21st International Conference on Parallel
and Distributed Systems (ICPADS), pages 831–836. IEEE, 2015.
➢ Mostafa Ead, Herodotos Herodotou, Ashraf Aboulnaga, and Shivnath Babu. PStorM: Profile Storage and
Matching for Feedback-Based Tuning of MapReduce Jobs. In Proceedings of the 17th International
Conference on Extending Database Technology (EDBT), pages 1–12. OpenProceedings, March 2014.
➢ Mikhail Genkin, Frank Dehne, Maria Pospelova, Yabing Chen, and Pablo Navarro. Automatic, On-Line Tuning
of YARN Container Memory and CPU Parameters. In Proceedings of the 2016 IEEE 18th International
Conference on High Performance Computing and Communications (HPCC), pages 317–324. IEEE, 2016.
➢ Anastasios Gounaris, Georgia Kougka, Ruben Tous, Carlos Tripiana Montes, and Jordi Torres. Dynamic
Configuration of Partitioning in Spark Applications. IEEE Transactions on Parallel and Distributed Systems
(TPDS), 28(7):1891–1904, 2017.
➢ Anastasios Gounaris and Jordi Torres. A Methodology for Spark Parameter Tuning. Big Data Research, 11:22–
32, March 2017.
➢ Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham, and Zelong Liu. MRSim: A Discrete Event
Based MapReduce Simulator. In Proceedings of the Seventh International Conference on Fuzzy Systems and
Knowledge Discovery (FSKD), volume 6, pages 2993–2997. IEEE, 2010.
➢ Álvaro Brandón Hernández, María S Perez, Smrati Gupta, and Victor Muntés-Mulero. Using Machine Learning
to Optimize Parallelism in Big Data Applications. Future Generation Computer Systems, pages 1–12, 2017.
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 96
References (3/6)
➢ Herodotos Herodotou and Shivnath Babu. Profiling, What-if Analysis, and Cost-based Optimization of
MapReduce Programs. Proceedings of the VLDB Endowment, 4(11):1111–1122, 2011.
➢ Herodotos Herodotou and Shivnath Babu. A What-if Engine for Cost-based MapReduce Optimization. IEEE
Data Engineering Bulletin, 36(1):5–14, 2013.
➢ Herodotos Herodotou, Fei Dong, and Shivnath Babu. No One (Cluster) Size Fits All: Automatic Cluster Sizing
for Data-intensive Analytics. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC). ACM,
2011.
➢ Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath
Babu. Starfish: A Self-tuning System for Big Data Analytics. In Proceedings of the Fifth Biennial Conference on
Innovative Data Systems Research (CIDR), volume 11, pages 261–272, 2011.
➢ Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. An Analysis of Traces from a Production
MapReduce Cluster. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud
and Grid Computing, pages 94–103. IEEE Computer Society, 2010.
➢ Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. Hadoop Performance Modeling for Job
Estimation and Resource Provisioning. IEEE Transactions on Parallel and Distributed Systems (TPDS),
27(2):441–454, 2016.
➢ Palden Lama and Xiaobo Zhou. AROMA: Automated Resource Allocation and Configuration of MapReduce
Environment in the Cloud. In Proceedings of the 9th international conference on Autonomic computing
(ICAC), pages 63–72. ACM, 2012.
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 97
References (4/6)
➢ Min Li, Liangzhao Zeng, Shicong Meng, Jian Tan, Li Zhang, Ali R Butt, and Nicholas Fuller. MROnline:
MapReduce Online Performance Tuning. In Proceedings of the 23rd International Symposium on High-
performance Parallel and Distributed Computing (HPDC), pages 165–176. ACM, 2014.
➢ Guangdeng Liao, Kushal Datta, and Theodore L Willke. Gunther: Search-based Auto-tuning of MapReduce. In
Proceedings of the European Conference on Parallel Processing (Euro-Par), pages 406–419. Springer, 2013.
➢ Chao Liu, Deze Zeng, Hong Yao, Chengyu Hu, Xuesong Yan, and Yuanyuan Fan. MR-COF: A Genetic
MapReduce Configuration Optimization Framework. In Proceedings of the International Conference on
Algorithms and Architectures for Parallel Processing (ICA3PP), pages 344–357. Springer, 2015.
➢ Jun Liu, Nishkam Ravi, Srimat Chakradhar, and Mahmut Kandemir. Panacea: Towards Holistic Optimization of
MapReduce Applications. In Proceedings of the Tenth International Symposium on Code Generation and
Optimization (CGO), pages 33–43. ACM, 2012.
➢ Yang Liu, Maozhen Li, Nasullah Khalid Alham, and Suhel Hammoud. HSim: A MapReduce Simulator in Enabling
Cloud Computing. Future Generation Computer Systems, 29(1):300–308, 2013.
➢ Mumak: Map-Reduce Simulator, 2010. https://issues.apache.org/jira/browse/MAPREDUCE-728.
➢ Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. Spark Parameter Tuning via Trial-and-Error. In
Proceedings of the INNS Conference on Big Data, pages 226–237. Springer, 2016.
➢ Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. MRTuner: a Toolkit to Enable Holistic
Optimization for MapReduce Jobs. Proceedings of the VLDB Endowment, 7(13):1319–1330, 2014.
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 98
References (5/6)
➢ Rekha Singhal and Praveen Singh. Performance Assurance Model for Applications on SPARK Platform. In
Proceedings of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC), pages
131–146. Springer, 2017.
➢ Fei Teng, Lei Yu, and Frederic Magoulès. SimMapReduce: A Simulator for Modeling MapReduce Framework.
In Proceedings of the 5th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE),
pages 277–282. IEEE, 2011.
➢ Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans,
Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. Apache Hadoop YARN: Yet Another Resource
Negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SoCC), page 5. ACM, 2013.
➢ Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efficient
Performance Prediction for Large-Scale Advanced Analytics. In Proceedings of the 13th USENIX Symposium on
Networked Systems Design and Implementation (NSDI), pages 363–378. USENIX Association, 2016.
➢ Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. ARIA: Automatic Resource Inference and
Allocation for Mapreduce Environments. In Proceedings of the 8th ACM International Conference on
Autonomic Computing (ICAC), ICAC ’11, pages 235–244, New York, NY, USA, 2011. ACM.
➢ Abhishek Verma, Ludmila Cherkasova, and Roy H Campbell. Play it Again, SimMR! In Proceedings of the 2011
IEEE International Conference on Cluster Computing (CLUSTER), pages 253–261. IEEE, 2011.
VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 99
References (6/6)
➢ Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. Resource Provisioning Framework for MapReduce
Jobs with Performance Goals. In Proceedings of the ACM/IFIP/USENIX 12th International Middleware
Conference, pages 165–186. Springer, 2011.
➢ Guanying Wang, Ali R Butt, Prashant Pandey, and Karan Gupta. A Simulation Approach to Evaluating Design
Decisions in MapReduce Setups. In Proceedings of the 2009 IEEE International Symposium on Modeling,
Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pages 1–11. IEEE, 2009.
➢ Guolu Wang, Jungang Xu, and Ben He. A Novel Method for Tuning Configuration Parameters of Spark based on
Machine Learning. In Proceedings of the 2016 IEEE 18th International Conference on High Performance
Computing and Communications (HPCC), pages 586–593. IEEE, 2016.
➢ Kewen Wang, Xuelian Lin, and Wenzhong Tang. Predator - An Experience Guided Configuration Optimizer for
Hadoop MapReduce. In Proceedings of the IEEE 4th International Conference on Cloud Computing Technology
and Science (CloudCom), pages 419–426. IEEE, 2012.
➢ Dili Wu and Aniruddha Gokhale. A Self-tuning System based on Application Profiling and Performance Analysis
for Optimizing Hadoop MapReduce Cluster Configuration. In Proceedings of the 20th International Conference
on High Performance Computing (HiPC), pages 89–98. IEEE, 2013.

More Related Content

What's hot

Back propagation method
Back propagation methodBack propagation method
Back propagation method
Prof. Neeta Awasthy
 
Hadoop and Financial Services
Hadoop and Financial ServicesHadoop and Financial Services
Hadoop and Financial Services
Cloudera, Inc.
 
What Is Machine Learning? | Machine Learning Basics | Edureka
What Is Machine Learning? | Machine Learning Basics | EdurekaWhat Is Machine Learning? | Machine Learning Basics | Edureka
What Is Machine Learning? | Machine Learning Basics | Edureka
Edureka!
 
Лекция №11 "Основы нейронных сетей"
Лекция №11 "Основы нейронных сетей" Лекция №11 "Основы нейронных сетей"
Лекция №11 "Основы нейронных сетей"
Technosphere1
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
Amr BARAKAT
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
Mehdi Shibahara
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
Taeoh Kim
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
DataWorks Summit/Hadoop Summit
 
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
Edureka!
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Hadoop and friends : introduction
Hadoop and friends : introductionHadoop and friends : introduction
Hadoop and friends : introductionfredcons
 
Deep learning
Deep learningDeep learning
Deep learning
Ratnakar Pandey
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Spark Summit
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 

What's hot (20)

Back propagation method
Back propagation methodBack propagation method
Back propagation method
 
Hadoop and Financial Services
Hadoop and Financial ServicesHadoop and Financial Services
Hadoop and Financial Services
 
What Is Machine Learning? | Machine Learning Basics | Edureka
What Is Machine Learning? | Machine Learning Basics | EdurekaWhat Is Machine Learning? | Machine Learning Basics | Edureka
What Is Machine Learning? | Machine Learning Basics | Edureka
 
Лекция №11 "Основы нейронных сетей"
Лекция №11 "Основы нейронных сетей" Лекция №11 "Основы нейронных сетей"
Лекция №11 "Основы нейронных сетей"
 
Decision trees for machine learning
Decision trees for machine learningDecision trees for machine learning
Decision trees for machine learning
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
CNN Attention Networks
CNN Attention NetworksCNN Attention Networks
CNN Attention Networks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
KNN Algorithm using Python | How KNN Algorithm works | Python Data Science Tr...
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop and friends : introduction
Hadoop and friends : introductionHadoop and friends : introduction
Hadoop and friends : introduction
 
Deep learning
Deep learningDeep learning
Deep learning
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 

Similar to Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems

Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
Sri Ambati
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
MongoDB
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Edwin Poot
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
MariaDB plc
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
Data Driven Innovation
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009HawaDia
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Seeling Cheung
 
Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...
IRJET Journal
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
VMware Tanzu
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Robert Grossman
 
Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shangSAIL_QU
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Denodo
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Yellowfin
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
Databricks
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
DataOps - Production ML
DataOps - Production MLDataOps - Production ML
DataOps - Production ML
Al Zindiq
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
Dhiren Gala
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 

Similar to Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems (20)

Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Unlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data LakeUnlocking Operational Intelligence from the Data Lake
Unlocking Operational Intelligence from the Data Lake
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
Applying linear regression and predictive analytics
Applying linear regression and predictive analyticsApplying linear regression and predictive analytics
Applying linear regression and predictive analytics
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Tips tricks to speed nw bi 2009
Tips tricks to speed  nw bi  2009Tips tricks to speed  nw bi  2009
Tips tricks to speed nw bi 2009
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
 
Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...Democratization of NOSQL Document-Database over Relational Database Comparati...
Democratization of NOSQL Document-Database over Relational Database Comparati...
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Icse2013 shang
Icse2013 shangIcse2013 shang
Icse2013 shang
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
DataOps - Production ML
DataOps - Production MLDataOps - Production ML
DataOps - Production ML
 
Improving Reporting Performance
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 

Recently uploaded

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems

  • 1. Speedup your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems VLDB 2019 Tutorial Jiaheng Lu, University of Helsinki Yuxing Chen, University of Helsinki Herodotos Herodotou, Cyprus University of Technology Shivnath Babu, Duke University / Unravel Data Systems
  • 2. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 2 Outline Motivation and Background History and Classification Parameter Tuning on Databases Parameter Tuning on Big Data Systems Applications of Automatic Parameter Tuning Open Challenges and Discussion
  • 3. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 3 Modern applications are being built on a collection of distributed systems
  • 4. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 4 But: Running distributed applications reliably & efficiently is hard
  • 5. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 5 My app failed
  • 6. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 6 My data pipeline is missing SLA
  • 7. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 7 My cloud cost is out of control
  • 8. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 8 There are many challenges
  • 9. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 9 Self-driving Systems Automate physical data layout Index and view recommendation Knob tuning, buffer pool sizes, etc Plan optimization
  • 10. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 10 Different Optimization Levels of Self-driving Systems The focus of this tutorial Automate physical data layout Index and view recommendation Knob tuning, buffer pool sizes, etc Plan optimization
  • 11. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 11 Effectiveness of Knob (Parameter) Tuning
  • 12. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 12 Selected Performance-aware Parameters in PostgreSQL Parameter Name Description Default value bgwriter_lru_maxpages Max number of buffers written by the background writer 100 checkpoint_segments Max number of log file segments between WAL checkpoints 3 checkpoint_timeout Max time between automatic WAL checkpoints 5 min deadlock_timeout Waiting time on locks for checking for deadlocks 1 sec effective_cache_size Size of the disk cache accessible to one query 4 GB effective_io_concurrency Number of disk I/O operations to be executed concurrently 1 or 0 shared_buffers Memory size for shared memory buffers 128MB
  • 13. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 13 Selected Performance-aware Parameters in Hadoop Parameter Name Description Default value dfs.block.size The default block size for files stored HDFS 128MB mapreduce.map.tasks Number of map tasks 2 mapreduce.reduce.tasks Number of reduce tasks 1 mapreduce.job.reduce .slowstart.completedmaps Min percent of map tasks completed before scheduling reduce tasks 0.05 mapreduce.map.combine .minspills Min number of map output spill files present for using the combine function 3 mapreduce.reduce.merge.in mem.threshold Max number of shuffled map output pairs before initiating merging during the shuffle 1000
  • 14. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 14 Selected Performance-aware Parameters in Spark Parameter Name Description Default value spark.driver.cores Number of cores used by the Spark driver process 1 spark.driver.memory Memory size for driver process 1 GB spark.sql.shuffle.partitions Number of tasks 200 spark.executor.cores The number of cores for each executor 1 spark.files.maxPartitionBytes Max number of bytes to group into one partition during file reading 128MB spark.memory.fraction Fraction for execution and storage memory. It may cause frequent spills or cached data eviction if given a low fraction 0.6
  • 15. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 15 Challenge 1: A Huge Number of Systems
  • 16. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 16 Challenge 2: Many Parameters in a Single System
  • 17. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 17 Challenge 3: Diverse Workloads and System Complexity An example of Spark framework
  • 18. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 18 Franco Pepe Chef: “There is no pizza recipe. Every time the dough was made there were no scales, recipes, machinery.” There is no knob tuning recipe. Every time, we need to configure the parameters based on the bottleneck of different jobs and environment. -- VLDB 2019 tutorial
  • 19. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 19 Running Examples of Parameter Tuning (Hadoop)* Workloads ➢ Terasort: Sort a terabyte of data ➢ N-gram: Compute the inverted list of N-gram data ➢ PageRank: Compute pagerank of graphs Hadoop platform with MapReduce * Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014)
  • 20. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 20 Running Examples of Parameter Tuning (Hadoop) Problem: Given a MapReduce (or Spark) job with input data and running cluster, we want to find the setting of parameters that optimize the execution time of the job (i.e., minimize the job execution time)
  • 21. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 21 Tuned Key Parameters in Hadoop Parameter Name Description MapInputSplit Split number for map jobs MapOutputBuffer Buffer size of map output MapOutputCompression Whether the map output data is compressed ReduceCopy Time to start the copy in Reduce phase ReduceInputBuffer Input buffer size of Reduce ReduceOutput Reduce output block size
  • 22. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 22 Impact of Parameters on Selected Jobs TeraSort Job N-gram Job PageRank Job
  • 23. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 23 Comparison between Hadoop-X and MRTuner with Different Parameters
  • 24. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 24 50 Years of Knob Tuning Rule-based & DBA guideline (1960s) Cost-model (1970s) Experiments-driven (1980s) Machine learning (2007) Adaptive model (2006) Simulation (2000s)
  • 25. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 25 Classification of Existing Approaches Approach Main Idea Rule-based Based on the experience of human experts Cost Modeling Using statistical cost functions Simulation-based Modular or complete system simulation Experiment-driven Execute an experiment with different parameter settings Machine Learning Employ machine learning methods Adaptive Tune configuration parameters adaptively
  • 26. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 26 Rule-based Approach ➢ Assist users based on the experience of human experts Parameter Name Default Description Recommendation dfs.replication in HDFS 3 Lower it to reduce replication cost. Higher replication can make data local to more workers, but more space overhead IF (Running time is the critical goal and enough space) Set 5 Otherwise Set 3
  • 27. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 27 Cost Modeling Approach ➢ Build performance prediction models by using statistical cost functions Cost Constants Cost Formulas Cost Estimation Operations Parameter Values Statistics
  • 28. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 28 Simulation-based Approach ➢ Build performance models based on modular or complete simulation
  • 29. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 29 Experiment-driven Approach ➢ Execute the experiments repeatedly with different parameter settings, guided by a search algorithm Input knobs Recommended knobs Goal Experiments Exploit knobs
  • 30. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 30 Machine Learning Approach ➢ Establish performance models by employing machine learning methods ➢ Consider the complex system as a whole and assume no knowledge of system
  • 31. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 31 Adaptive Approach ➢ Tune parameters adaptively while an application is running ➢ Adjust the parameter settings as the environment changes Online CLOT (2006) strategy New query New environment Self-tuning Module Recommend index selection
  • 32. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 32 Outline Motivation and Background History and Classification Parameter Tuning on Databases Parameter Tuning on Big Data Systems Applications of Automatic Parameter Tuning Open Challenges and Discussion
  • 33. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 33 What and How to Tune? ➢ What to configure? ❖ Which parameters (knobs)? ❖ Which are most important? ➢ How to tune (to best throughput)? ❖ Increase buffer size? ❖ More parallelism on writing? I am a database I am running queries Figure. Tuning guitar knobs to right notes (frequencies) Run faster? Higher throughput?
  • 34. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 34 What to Tune – Some Important Knobs for throughput Parameter Name Brief Description and Use Deafult bgwriter_delay Background writer’s delay between activity rounds 200ms bgwriter_lru_maxpages Max number of buffers written by the background writer 100 checkpoint_segments Max number of log file segments between WAL checkpoints 3 checkpoint_timeout Max time between automatic WAL checkpoints 5min deadlock_timeout Waiting time on locks for checking for deadlocks 1s default_statistics_target Default statistics target for table columns 100 effective_cache_size Effective size of the disk cache accessible to one query 4GB shared_buffers Memory size for shared memory buffers 128MB Memory Cache Timeout Settings Threads
  • 35. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 35 What are the Important Parameters and How to Choose ➢ Affect the performance most (manually) ❖ Based on expert experiences ❖ Default documentation If you want higher throughput, better tuning memory-related parameters Performance-sensitive parameters are important! Parameters have strong correlation to performance are important!
  • 36. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 36 What are the Important Parameters and How to Choose ➢ Affect the performance most ➢ Strongest correlation between parameters and objective function (model) ❖ Linear regression model for independent parameters: ❑ Regularized version of least squares – Lasso (OtterTune 2017) ✔ Interpretable, stable, and computationally efficient with higher dimensions ❖ Deep learning model (CBDTune 2019) ❑ The important input parameters will gain higher weights in training Weights Knobs
  • 37. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 37 How to Tune – Key Tuning Goals ➢ Avoidance: to identify and avoid error-prone configuration settings ➢ Ranking: to rank parameters according to the performance impact ➢ Profiling: to classify and store useful log information from previous runs ➢ Prediction: to predict the database or workload performance under hypothetical resource or parameter changes ➢ Tuning: to recommend parameter values to achieve objective goals
  • 38. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 38 How to Tune – Tuning Methods Methods Approach Methodology Target Level Rule-based SPEX (2013) Constraint inference Avoidance Xu (2015) Configuration navigation Ranking Cost-model STMM (2006) Cost model Tuning Simulation- based Dushyanth (2005) Trace-based simulation Prediction ADDM (2005) DAG model & simulation Profiling, tuning Experiment driven SARD (2008) P&B statistical design Ranking iTuned (2009) LHS & Guassian Process Profiling, tuning Machine Learning Rodd (2016) Neural Networks Tuning OtterTune (2017) Guassian Process Ranking, tuning CDBTune (2019) Deep RL Tuning Adaptive COLT (2006) Cost Vs. Gain analysis Profiling, tuning
  • 39. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 39 Relational Database Tuning Methods Rule-based Cost Modeling Simulation-based Experiment-driven Machine Learning Covered #knobs Training cost Time Rule-based Simulation-based Experiment-driven Machine Learning Adaptive Cost Modeling Figure. Required expert knowledge on systemFigure. Developing trend: putting more training cost to uncover more knobs 2005 2017
  • 40. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 40 Tuning Method: Rule-based ➢ Tuning based on rules derived from DBAs’ expertise, experience, and knowledge, or Rule of Thumb default recommendation Expert Guarantee cache memory to accelerate queries … Rule of Thumb Better not change the deadlock timeout if … Documents Default settings work most of the cases …Rules Trusted Parties
  • 41. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems ➢ A cost model establishes a performance model by cost functions based on the deep understanding of system components Cost Model 41 Tuning Method: Cost Modeling Cost Constants Cost Formulas Cost Estimate Operations Parameter Values Statistics
  • 42. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 42 Tuning Method: Cost Modeling (STMM) ➢ STMM: Adaptive Self-Tuning Memory in DB2 (2006) ❖ Reallocates memory for several critical components(e.g., compiled statement cache, sort, and buffer pools)
  • 43. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 43 Tuning Method: Simulation-based ➢ A simulation-based approach simulates workloads in one environment and learns experience or builds models to predict the performance in another. Running job here is (1) expensive or (2) slowdown concurrent jobs or (3)… Simulate it in small environment with tiny portion of data … Often product environment Often test environment
  • 44. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 44 Tuning Method: Experiment-driven ➢ An experiment-driven approach relies on repeated executions of the same workload under different configuration settings towards tuning parameter values Input knobs Recommended knobs Goal Experiments Exploit knobs Classic paper: Tuning Database Configuration Parameters with iTuned. 2009
  • 45. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 45 Tuning Method: Machine Learning ➢ Machine Learning (ML) approaches aim to tune parameters automatically by taking advantages of ML methods. ML Training Input knobs Recommended knobs Goal ML Model Training logs Actual run logs
  • 46. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 46 Tuning Method: Machine Learning (OtterTune 2017) ➢ Factor Analysis: transform high dimension parameters to few factors ➢ Kmeans: Cluster distinct metrics ➢ Lasso: Rank parameters ➢ Gaussian Process: Predict and tune performance Figure. OtterTune system architecture
  • 47. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 47 Tuning Method: Machine Learning (CDBTune 2019) ➢ Reinforcement learning ▪ State: knobs and metrics ▪ Reward: performance change ▪ Action: recommended knobs ▪ Policy: Deep Neural network ➢ Key idea ▪ Feedback: try-and-error method ▪ Recommend -> good/bad ▪ Deep deterministic policy gradient ▪ Actor critic algorithm Figure. CDBTune Deep deterministic policy gradient Reward: Throughput and latency performance change Δ from time t − 1 and the initial time to time t
  • 48. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems Online 48 Tuning Method: Adaptive ➢ An adaptive approach changes parameter configurations online as the environment or query workload changes Figure. CLOT (2006) strategy New query New environment Self-tuning Module Recommend index selection
  • 49. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 49 The Differences of Tuning Database & Big Data Systems in research papers Relational Database Big Data System Parameters More parameters on memory More parameters on vcores Resource Often fixed resources Now more varying resources Scalability Often single machine Often many machines in a distributed environment Metrics Throughput, latency Time, resource cost
  • 50. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 50 References (1/1) ➢ Narayanan, Dushyanth, Eno Thereska, and Anastassia Ailamaki. "Continuous resource monitoring for self-predicting DBMS." 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 2005. ➢ Dias, K., Ramacher, M., Shaft, U., Venkataramani, V., & Wood, G.. Automatic Performance Diagnosis and Tuning in Oracle. In CIDR (pp. 84-94), 2005 ➢ Schnaitter, K., Abiteboul, S., Milo, T., & Polyzotis, N. Colt: continuous on-line tuning. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data (pp. 793-795). ACM, 2006. (CLOT) ➢ Storm, Adam J., et al. "Adaptive self-tuning memory in DB2." Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006. (STMM) ➢ Debnath, Biplob K., David J. Lilja, and Mohamed F. Mokbel. "SARD: A statistical approach for ranking database tuning parameters." 2008 IEEE 24th International Conference on Data Engineering Workshop. IEEE, 2008 ➢ Babu, S., Borisov, N., Duan, S., Herodotou, H., & Thummala, V. Automated Experiment-Driven Management of (Database) Systems. In HotOS, 2009. ➢ Duan, Songyun, Vamsidhar Thummala, and Shivnath Babu. "Tuning database configuration parameters with iTuned." Proceedings of the VLDB Endowment 2.1: 1246-1257, 2009. (iTuned) ➢ Rodd, Sunil F., and Umakanth P. Kulkarni. "Adaptive tuning algorithm for performance tuning of database management system." arXiv preprint arXiv:1005.0972, 2010. ➢ Van Aken, D., Pavlo, A., Gordon, G. J., & Zhang, B.. Automatic database management system tuning through large- scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data(pp. 1009- 1024). ACM, 2017. (OtterTune) ➢ Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., & Ran, M. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the 2019 International Conference on Management of Data (pp. 415-432). ACM, 2019. (CBDTune)
  • 51. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 51 Outline Motivation and Background History and Classification Parameter Tuning on Databases Parameter Tuning on Big Data Systems Applications of Automatic Parameter Tuning Open Challenges and Discussion
  • 52. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 52 Ecosystems for Big Data Analytics MapReduce-based Systems Spark-based Systems Resource Managers Distributed File Systems
  • 53. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 53 Executing Analytics Workloads Goal: Execute MapReduce workload Decisions: Task Task Task Input Output Goal: Execute an analytics workload in < 2 hours ➢ Task parallelism ➢ Use compression ➢ … ➢ Container settings ➢ Executor cores ➢ … ➢ Number of nodes ➢ Machine specs ➢ … C C NMDN C C NMDN ResourcesPlatformJob
  • 54. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 54 Effect of Job-level Configuration Parameters ➢ 190+ parameters in Hadoop, 15-20 impact performance ➢ 200+ parameters in Spark, 20-30 impact performance Scenario: 2 MapReduce jobs, 50GB, 16-node EC2 cluster Word Co-occurrence Terasort Two-dimensional projections of a multi-dimensional surface
  • 55. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 55 Tuning Challenges ➢ High-dimensional space of configuration parameters ➢ Non-linear effect of hardware/applications/parameters on performance ➢ Heavy use of programming languages (e.g., Java/Python) ➢ Lack of schema & statistics for input data residing in files ➢ Terabyte-scale data cycles
  • 56. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 56 Applying Cost-based Optimization
  • 57. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 57 Applying Cost-based Optimization Profile Collect concise & general summaries of execution Predict Estimate impact of hypothetical changes on execution Optimize
  • 58. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 58 Profiling MapReduce Job Execution ⮚ Use dynamic instrumentation ❖ Support unmodified MapReduce programs ⮚ Use sampling techniques ❖ Minimize overhead of profiling Dataflow Statistics Dataflow Counters Cost Statistics Cost Counters MapReduce Job Execution
  • 59. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 59 Predicting Job Profiles in Starfish Dataflow Statistics Dataflow Counters Cost Statistics Cost Counters Dataflow Statistics Dataflow Counters Cost Statistics Cost Counters Cardinality Models Relative Black-box Models Analytical Models Analytical Models
  • 60. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 60 Job Optimization & Resource Provisioning Dataflow Statistics Dataflow Counters Cost Statistics Cost Counters Cluster Resources r2 Input Data d2 Job Optimizer Enumerate Independent Subspaces Recursive Random Search Input Data d2 Elastisizer Enumerate Cloud Resource Options
  • 61. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 61 MapReduce Cost Modeling Approaches Approach Modeling Optimization Target Level Starfish (2011-13) Analytical & relative black box models Recursive Random Search Job, Platform, Cloud ARIA (2011) Analytical models Lagrange Multipliers Job, Platform HPM (2011) Scaling models & LR Brute-force Search Platform Predator (2012) Analytical models Grid Hill Climbing Job MRTuner (2014) PTC analytical models Grid-based search Job, Platform CRESP (2014) Analytical models & LR Brute-force Search Platform, Cloud MR-COF (2015) Analytical models & MRPerf simulation Genetic Algorithm Job IHPM (2016) Scaling models & LWLR Lagrange Multipliers Platform, Cloud
  • 62. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 62 Spark Cost Modeling Approaches ➢ Focus: ➢ Unique features: ❖ Ernest (2016): Focus on machine learning Spark applications ❖ Assurance (2017): Mixes white-box models with simulation ❖ DynamiConf (2017): Optimizes degree of parallelism for tasks Predict performance on large clusters, full data Profile using small clusters, data samples
  • 63. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 63 Cost Modeling Approach: Pros & Cons Pros Very efficient for predicting performance Good accuracy in many (not complex) scenarios Cons Hard to capture complexity of system internals & pluggable components (e.g., schedulers) Models often based on simplified assumptions Not effective on heterogeneous clusters
  • 64. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 64 Simulation-based Approach ➢ Key Objective: Accurately predict MapReduce job performance at fine granularity ❖ Sorry, no fully-fledged Spark simulator available at this point! ➢ Use cases: ❖ Find optimal configuration settings ❖ Find cluster settings based on user requirements ❖ Identify performance bottlenecks ❖ Test new pluggable components (e.g., schedulers) ➢ Common technique: discrete event simulation
  • 65. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 65 HSim: Hadoop Simulator Detailed fine-grained execution trace Slave nodeMaster node JobTracker TaskTracker MapperSim ReducerSimTask Heartbeat Job Reader Cluster Reader Job Specs - Data properties - Conf parameters Cluster Specs - Network topology - Node resources
  • 66. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 66 Comparison of Hadoop Simulators Simulator Network Traffic Hardware Properties MapReduce Execution MapReduce Scheduling Conf Parameters MRPerf (2009) Yes (ns-2) Yes Task sub-phases No Only few MRSim (2010) Yes (GridSim) Yes Task sub-phases No Several Mumak (2009) No No Only task level No Only few SimMR (2011) No No Task sub-phases FIFO, Deadline Only few SimMapRed (2011) Yes (GridSim) Yes Task sub-phases Several Only few HSim (2013) Yes (GridSim) Yes Task sub-phases FIFO, FAIR Several
  • 67. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 67 Simulation-based Approach: Pros & Cons Pros High accuracy in simulating dynamic system behaviors Efficient for predicting fine-grained performance Cons Hard to comprehensively simulate complex internal dynamics Unable to capture dynamic cluster utilization Not very efficient for finding optimal settings
  • 68. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 68 General Experiment-driven Architecture [Sample] Input Data MR/Spark Application Hadoop/Spark Cluster Parameter Search Engine Best Parameter Settings Application Executor Conf Performance Analyzer Logs
  • 69. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 69 Experiment-driven Approaches [Sample] Input Data MR/Spark Application Best Parameter Settings Performance Analyzer Application Executor Conf Parameter Search Engine Panacea (2012) • Grid search on independent subspaces Gunther (2013) • Genetic search algorithm Petridis (2016) • Search tree based on heuristics AutoTune (2018) • Multiple bound and search algorithm
  • 70. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 70 Gounaris (2018) Exp-driven Approach Test all candidate configurations and keep best one Use benchmarking applications (sort-by-key, shuffling, k-means) Test parameter values separately and in pairs (117 runs for 15 parameters) Create candidate configurations (9 complex parameter configurations) Offline Online Spark Application
  • 71. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 71 Experiment-driven Approach: Pros & Cons Pros Finds good settings based on real test runs on real systems Works across different system versions and hardware Cons Very time consuming as it requires multiple actual runs Not cost effective for ad-hoc analytics applications
  • 72. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 72 Machine Learning (ML) Approaches ➢ Three categories of ML approaches: 1) Build a historical store and use similarity measures 2) Perform clustering and ML modeling per cluster 3) Train and utilize a ML model per application Offline Phase Historical data Output Feature selection Modeling Online Phase New job Best conf Prediction Optimization
  • 73. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 73 ML Approaches with Historical Stores Offline Phase MR Jobs Execute & collect data Store: <Features, Execution time> Online Phase Kavulya (2010) PStorm (2014) New MR Job Similar Jobs & Execution Times Execution Time Prediction k-Nearest-Neighbors Locally-weighted Linear Regression MR Jobs Execute & collect data Store: <Features, Job profile> New MR Job Job Features New Job Profile Sample execution XGBoost probing Optimal Configuration Starfish Optimizer
  • 74. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 74 ML Approaches with Clustering Offline Phase Online Phase AROMA (2012) PPABS (2013) Job Utilization Data k-mediod Clustering Job Clusters SVM Model per Cluster Support Vector Machine New MR Job Job Utilization Data SVM Model Sample execution Find cluster Optimal Resources & Configuration Pattern Search algorithm New MR Job Job Utilization Data Sample execution Find cluster Optimal Configuration Job Utilization Data k-means++ Clustering Job Clusters Optimal Conf per Cluster Simulated Annealing
  • 75. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 75 ML Approaches with App Modeling Offline Phase Online Phase Sample executions MR Job / Spark App Training Data Machine Learning Model Training model * Guolu (2016) * Decision Tree (C5.0) + Recursive Random Search Hernandez (2017) * Boosted Regression Trees + Heuristic algorithm Chen (2015) * Tree-based Regression + Random Hill Climbing RFHOC (2016) * Random-Forest Approach + Genetic Algorithm MR Job / Spark App Execution Time Prediction Employ ML model Optimal Configuration Search algorithm +
  • 76. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 76 Machine Learning Approach: Pros & Cons Pros Ability to capture complex system dynamics Independence from system internals and hardware Learning based on real observations of system performance Cons Requires large training sets, which are expensive to collect Training from history logs leads to data under- fitting Typically low accuracy for unseen analytics applications
  • 77. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 77 Adaptive Approach ➢ Key idea: Track execution of a job and change its configuration in an online fashion in order to improve performance Map Wave 1 Map Wave 3 Map Wave 2 Reduce Wave 1 Reduce Wave 2 Shuffle 1. Collect statistics from previous wave(s) 2. Set better configurations for next wave *
  • 78. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 78 Adaptive Approach ➢ Key idea: Track execution of a job and change its configuration in an online fashion in order to improve performance Map Wave 1 Map Wave 3 Map Wave 2 Reduce Wave 1 Reduce Wave 2 Shuffle 1. Collect statistics from previous wave(s) 2. Set better configurations for next wave * MROnline (2014) * Gray-box hill climbing Ant (2014) * Genetic Algorithm JellyFish (2015) * Model-based hill climbing
  • 79. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 79 The KERMIT (2016) Approach Master Node Slave Node Node Manager Slave Node Node Manager Slave Node Node Manager YARN Resource Manager Container Spark AppContainer Resource requests Container MR App Container Resource requests KERMIT 1. Observe container performance 2. Adjust memory & CPU allocations to maximize performance
  • 80. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 80 Adaptive Approach: Pros & Cons Pros Finds good settings based on actual task runs Able to adjust to dynamic runtime status Works well for ad-hoc analytics applications Cons Only applies to long-running analytics applications Inappropriate configuration can cause issues (e.g., stragglers) Neglects efficient resource utilization in the whole system
  • 81. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 81 Outline Motivation and Background History and Classification Parameter Tuning on Databases Parameter Tuning on Big Data Systems Applications of Automatic Parameter Tuning Open Challenges and Discussion
  • 82. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 82 Auto Parameter Tuning in Database Systems ➢ Oracle Self-driving Database ❖ Automatically set various memory parameters and use of compression using machine learning ➢ IBM DB2 Self-tuning Memory Manager ❖ Dynamically distributes available memory resources among buffer pools, locking memory, package cache, and sort memory ➢ Azure SQL Database Automatic Tuning ❖ Memory buffer settings, index management, plan choice correction
  • 83. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 83 Auto Parameter Tuning in Big Data Systems ➢ Databricks Optimized Autoscaling ❖ Automatically scale number of executors in Spark up and down ➢ Spotfire Data Science Autotuning ❖ Automatically set Spark parameters for number and memory size of executors ➢ Sparklens: Qubole’s Spark Tuning Tool ❖ Automatically set memory of Spark executors
  • 84. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 84 Auto Parameter Tuning with Unravel
  • 85. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems spark.driver.cores 2 spark.executor.cores … 10 spark.sql.shuffle.partitions 300 spark.sql.autoBroadcastJoinThreshold 20MB … SKEW('orders', 'o_custId') true spark.catalog.cacheTable(“orders") true … PERFORMANCE 85 Today, tuning is often by trial-and-error
  • 86. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 86 A New World INPUTS 1. App = Spark Query 2. Goal = Speedup “I need to make this app faster”
  • 87. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems TIME APPDURATION In blink of an eye, user gets recommendations to make the app 30% faster As user finishes checking email, she has a verified run that is 60% faster User comes back from lunch. A verified run that is 90% faster 90% faster! 87 A New World
  • 88. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems Response Surface MethodologyReinforcement Learning 88
  • 89. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 89 Autotuning Workflow Monitoring Data Historic Data & Probe Data Recommendation Algorithm Cluster Services On-premises and Cloud App,Goal Orchestrator Xnext Probe Algorithm
  • 90. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 90 Outline Motivation and Background History and Classification Parameter Tuning on Databases Parameter Tuning on Big Data Systems Applications of Automatic Parameter Tuning Open Challenges and Discussion
  • 91. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 91 Putting it all Together Approach Pros Cons Cost Modeling Very efficient for predicting performance Good accuracy in many (not complex) scenarios Very efficient for predicting performance Good accuracy in many (not complex) scenarios Hard to capture complexity of system internals & pluggable components (e.g., schedulers) Models often based on simplified assumptions Not effective on heterogeneous clusters Simulation-based High accuracy in simulating dynamic system behaviors Efficient for predicting fine-grained performance Hard to comprehensively simulate complex internal dynamics Unable to capture dynamic cluster utilization Not very efficient for finding optimal settings Experiment-driven Finds good settings based on real test runs on real systems Works across different system versions and hardware Very time consuming as it requires multiple actual runs Not cost effective for ad-hoc analytics applications Machine Learning Ability to capture complex system dynamics Independence from system internals and hardware Learning based on real observations of system performance Requires large training sets, which are expensive to collect Training from history logs leads to data under-fitting Typically low accuracy for unseen analytics applications Adaptive Finds good settings based on actual task runs Able to adjust to dynamic runtime status Works well for ad-hoc analytics applications Only applies to long-running analytics applications Inappropriate configuration can cause issues (e.g., stragglers) Neglects efficient resource utilization in the whole system
  • 92. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 92 Open Challenges Clusters are becoming heterogeneous in nature, both for compute and storage The proliferation of Cloud leads to multi- tenancy, overheads, perf interaction issues Real-time analytics pushes boundaries on latency requirements and combination of systems Ensuring good and robust system performance at scale poses new challenges
  • 94. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 94 References (1/6) ➢ Shivnath Babu. Towards Automatic Optimization of MapReduce Programs. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC), pages 137–142. ACM, 2010. ➢ Shivnath Babu, Herodotos Herodotou, et al. Massively Parallel Databases and MapReduce Systems. Foundations and Trends® in Databases, 5(1):1–104, 2013. ➢ Liang Bao, Xin Liu, and Weizhao Chen. Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks. arXiv preprint arXiv:1808.06008, 2018. ➢ Zhendong Bei, Zhibin Yu, Huiling Zhang, Wen Xiong, Chengzhong Xu, Lieven Eeckhout, and Shengzhong Feng. RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop’s Configuration. IEEE Transactions on Parallel and Distributed Systems (TPDS), 27(5):1470–1483, 2016. ➢ Chi-Ou Chen, Ye-Qi Zhuo, Chao-Chun Yeh, Che-Min Lin, and Shih-Wei Liao. Machine Learning-Based Configuration Parameter Tuning on Hadoop System. In Proceedings of the 2015 IEEE International Congress on Big Data (BigData Congress), pages 386–392. IEEE, 2015. ➢ Keke Chen, James Powers, Shumin Guo, and Fengguang Tian. CRESP: Towards Optimal Resource Provisioning for MapReduce Computing in Public Clouds. IEEE Transactions on Parallel and Distributed Systems (TPDS), 25(6):1403–1412, 2014. ➢ Dazhao Cheng, Jia Rao, Yanfei Guo, and Xiaobo Zhou. Improving MapReduce Performance in Heterogeneous Environments with Adaptive Task Tuning. In Proceedings of the 15th International Middleware Conference, pages 97–108. ACM, 2014. ➢ Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1):107–113, 2008.
  • 95. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 95 References (2/6) ➢ Xiaoan Ding, Yi Liu, and Depei Qian. Jellyfish: Online Performance Tuning with Adaptive Configuration and Elastic Container in Hadoop YARN. In Proceedings of the 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), pages 831–836. IEEE, 2015. ➢ Mostafa Ead, Herodotos Herodotou, Ashraf Aboulnaga, and Shivnath Babu. PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs. In Proceedings of the 17th International Conference on Extending Database Technology (EDBT), pages 1–12. OpenProceedings, March 2014. ➢ Mikhail Genkin, Frank Dehne, Maria Pospelova, Yabing Chen, and Pablo Navarro. Automatic, On-Line Tuning of YARN Container Memory and CPU Parameters. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (HPCC), pages 317–324. IEEE, 2016. ➢ Anastasios Gounaris, Georgia Kougka, Ruben Tous, Carlos Tripiana Montes, and Jordi Torres. Dynamic Configuration of Partitioning in Spark Applications. IEEE Transactions on Parallel and Distributed Systems (TPDS), 28(7):1891–1904, 2017. ➢ Anastasios Gounaris and Jordi Torres. A Methodology for Spark Parameter Tuning. Big Data Research, 11:22– 32, March 2017. ➢ Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham, and Zelong Liu. MRSim: A Discrete Event Based MapReduce Simulator. In Proceedings of the Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), volume 6, pages 2993–2997. IEEE, 2010. ➢ Álvaro Brandón Hernández, María S Perez, Smrati Gupta, and Victor Muntés-Mulero. Using Machine Learning to Optimize Parallelism in Big Data Applications. Future Generation Computer Systems, pages 1–12, 2017.
  • 96. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 96 References (3/6) ➢ Herodotos Herodotou and Shivnath Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. Proceedings of the VLDB Endowment, 4(11):1111–1122, 2011. ➢ Herodotos Herodotou and Shivnath Babu. A What-if Engine for Cost-based MapReduce Optimization. IEEE Data Engineering Bulletin, 36(1):5–14, 2013. ➢ Herodotos Herodotou, Fei Dong, and Shivnath Babu. No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC). ACM, 2011. ➢ Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. Starfish: A Self-tuning System for Big Data Analytics. In Proceedings of the Fifth Biennial Conference on Innovative Data Systems Research (CIDR), volume 11, pages 261–272, 2011. ➢ Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. An Analysis of Traces from a Production MapReduce Cluster. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pages 94–103. IEEE Computer Society, 2010. ➢ Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. Hadoop Performance Modeling for Job Estimation and Resource Provisioning. IEEE Transactions on Parallel and Distributed Systems (TPDS), 27(2):441–454, 2016. ➢ Palden Lama and Xiaobo Zhou. AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud. In Proceedings of the 9th international conference on Autonomic computing (ICAC), pages 63–72. ACM, 2012.
  • 97. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 97 References (4/6) ➢ Min Li, Liangzhao Zeng, Shicong Meng, Jian Tan, Li Zhang, Ali R Butt, and Nicholas Fuller. MROnline: MapReduce Online Performance Tuning. In Proceedings of the 23rd International Symposium on High- performance Parallel and Distributed Computing (HPDC), pages 165–176. ACM, 2014. ➢ Guangdeng Liao, Kushal Datta, and Theodore L Willke. Gunther: Search-based Auto-tuning of MapReduce. In Proceedings of the European Conference on Parallel Processing (Euro-Par), pages 406–419. Springer, 2013. ➢ Chao Liu, Deze Zeng, Hong Yao, Chengyu Hu, Xuesong Yan, and Yuanyuan Fan. MR-COF: A Genetic MapReduce Configuration Optimization Framework. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), pages 344–357. Springer, 2015. ➢ Jun Liu, Nishkam Ravi, Srimat Chakradhar, and Mahmut Kandemir. Panacea: Towards Holistic Optimization of MapReduce Applications. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO), pages 33–43. ACM, 2012. ➢ Yang Liu, Maozhen Li, Nasullah Khalid Alham, and Suhel Hammoud. HSim: A MapReduce Simulator in Enabling Cloud Computing. Future Generation Computer Systems, 29(1):300–308, 2013. ➢ Mumak: Map-Reduce Simulator, 2010. https://issues.apache.org/jira/browse/MAPREDUCE-728. ➢ Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. Spark Parameter Tuning via Trial-and-Error. In Proceedings of the INNS Conference on Big Data, pages 226–237. Springer, 2016. ➢ Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. MRTuner: a Toolkit to Enable Holistic Optimization for MapReduce Jobs. Proceedings of the VLDB Endowment, 7(13):1319–1330, 2014.
  • 98. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 98 References (5/6) ➢ Rekha Singhal and Praveen Singh. Performance Assurance Model for Applications on SPARK Platform. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC), pages 131–146. Springer, 2017. ➢ Fei Teng, Lei Yu, and Frederic Magoulès. SimMapReduce: A Simulator for Modeling MapReduce Framework. In Proceedings of the 5th FTRA International Conference on Multimedia and Ubiquitous Engineering (MUE), pages 277–282. IEEE, 2011. ➢ Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SoCC), page 5. ACM, 2013. ➢ Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 363–378. USENIX Association, 2016. ➢ Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. ARIA: Automatic Resource Inference and Allocation for Mapreduce Environments. In Proceedings of the 8th ACM International Conference on Autonomic Computing (ICAC), ICAC ’11, pages 235–244, New York, NY, USA, 2011. ACM. ➢ Abhishek Verma, Ludmila Cherkasova, and Roy H Campbell. Play it Again, SimMR! In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), pages 253–261. IEEE, 2011.
  • 99. VLDB 2019 Speedup Your Analytics: Automatic Parameter Tuning for Databases and Big Data Systems 99 References (6/6) ➢ Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. Resource Provisioning Framework for MapReduce Jobs with Performance Goals. In Proceedings of the ACM/IFIP/USENIX 12th International Middleware Conference, pages 165–186. Springer, 2011. ➢ Guanying Wang, Ali R Butt, Prashant Pandey, and Karan Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In Proceedings of the 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pages 1–11. IEEE, 2009. ➢ Guolu Wang, Jungang Xu, and Ben He. A Novel Method for Tuning Configuration Parameters of Spark based on Machine Learning. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications (HPCC), pages 586–593. IEEE, 2016. ➢ Kewen Wang, Xuelian Lin, and Wenzhong Tang. Predator - An Experience Guided Configuration Optimizer for Hadoop MapReduce. In Proceedings of the IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom), pages 419–426. IEEE, 2012. ➢ Dili Wu and Aniruddha Gokhale. A Self-tuning System based on Application Profiling and Performance Analysis for Optimizing Hadoop MapReduce Cluster Configuration. In Proceedings of the 20th International Conference on High Performance Computing (HiPC), pages 89–98. IEEE, 2013.