We implemented MapReduce cluster benchmark TeraSort by derivative free optimization (DFO) method having runtime function object. In this, every iteration of DFO method uses new values for Hadoop parameter configuration. These parameters are specified within the framework, we used Chef server and client tool which assists in this cluster configuration to ensure proper implementation of TeraSort application.
2. Sequence
• Abstract
• Introduction
• Workload Analysis of Search Engines
• Benchmarking Methodology and Decisions
• Scaleable Data Generation Tool
• Case Studies
• Conclusions
3. Introduction
• Implementation of the MapReduce cluster Benckmark
TeraSort by DFO method
• Every interacting DFO method presents new values for
parameter configuration of Hadoop.
• For these parameters, specified within the framework we
need to use a tool that assists in this cluster configuration to
ensure proper implementation of TeraSort application.
• Chef server and Chef client
4. TeraSort Benchmark
Terasort includes 3 MapReduce
applications:
● Teragen: generates the data.
● Terasort: samples the input data
and uses them with MapReduce to
sort the data.
● Teravalidate: validates the output is
sorted
5. DFO Method
• Derivative free optimization is a subject
of mathematical optimization.
• It refers to problems for which derivative information is
unavailable or
• methods that do not use derivatives.
• The derivative of a function of a real variable measures the
sensitivity to change of a quantity (dependent variable) which is
determined by another quantity (independent variable). E.g. the
derivative of the position of a moving object with respect to time is
the object's velocity.
6. Algorithm BOBYQA
• BOBYQA (Bound Optimization BY Quadratic Approximation) is
a numerical optimization algorithm by Michael J. D. Powell.
• Name of Powell's Fortran 77 implementation of the algorithm.
• BOBYQA solves bound constrained optimization problems without
using derivatives of the objective function, which makes it
a derivative-free algorithm.
• The algorithm solves the problem using a trust region method that
forms quadratic models by interpolation. One new point is
computed on each iteration, usually by solving a trust region sub
problem, subject to the bound constraints.
7. Algorithm COBYLA
• Constrained optimization by linear approximation
(COBYLA) is a numerical optimization method
for constrained problems where the derivative of the
objective function is not known,
• invented by Michael J. D. Powell.
• Powell invented COBYLA while working for Westland
Helicopters.
• COBYLA proceeds by iteratively approximating the
actual objective function with linear programs.
8. Hadoop Environment
• A physical cluster with 29 nodes was used,
• A master Hadoop server (responsible for
implementing the JobTracker and NameNode services)
• 28 Hadoop Slaves (dedicated to the implementation of
TaskTracker and DataNode services).
• 2 Gigabit Ethernet to perform the connectivity
between the 29 nodes
9. Hadoop Environment
• A front-end access to the cluster server, that
server is configured as a Chef Server also used
to organize the executions of DFO TeraSort
application is then characterized the
synchronization functions of the DFO plays
and updating parameter settings Hadoop
based on each iteration of DFO TeraSort
method.
10. Experiment Execution
• Nemesis a server that is not part of the cluster is used as a front end for the
implementation TeraSort application, running the DFO method and updating
settings Hadoop based on their output.
• The synchronization of executions TeraSort updates and Hadoop with the output
of DFO method is performed by dfo_hadoop_terasort application executed on the
front-end server.
• The implementation of dfo_hadoop_terasort application is supplied with a file that
contains the initial values of the configuration parameters of Hadoop, restrictions
so that these values do not reach unwanted data out value for the objective
function, tolerance value for the restrictions and maximum amount of interactions.
With the processing of the input file and the interaction with the Hadoop cluster is
discovered which parameter values cause a greater impact for faster execution of
TeraSort application, taking as output a file with the best configuration parameters
of that.
11. Experiment Execution
• As the cluster was composed of 28 servidoers slaves and each
server with two processors, for a total of 56 slots available
processing was decided to maintain 10% of this total, available for
tasks due to failures in implementation were spaced more than
once. Therefore, we used about 100 Gigabyte generated by Hadoop
Teragen.
• To confront the optimization of the execution time of Jobs, was
executed two DFO BOBYCA And COBYLA method, aiming to identify
which method best suits the application TeraSort forcenida by
Hadoop ....
• Two runs with both algorithms and 50 iterations to identify at what
time the executions were carried out can converge to a better
runtime.
16. Results
• We used of DFO method with BOBYQA and
COBYLA algorithms
• Presented the main difference in variation of
execution time of each iteration Jobs with
dfo_hadoop_terasort application,
• it is characterized mainly, how they treat
approximations of the points for the object
function, the quadratic or linear form
respectively.
25. 1000
1100
1200
1300
1400
1500
1600
1700
1800
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
number of iterations
TeraSort 50 New using COBYLA
execution progress
1000
1100
1200
1300
1400
1500
1600
1700
1800
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
number of iterations
TeraSort 50 B using COBYLA
execution progress
1000
1200
1400
1600
1800
2000
2200
2400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
number of iterations
TeraSort 50 A using BOBYQA
execution progress
1000
1500
2000
2500
3000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
number of iterations
TeraSort 50 C using BOBYQA
execution progress
26. The use of DFO method with BOBYQA and COBYLA algorithms and presents as main difference the variation of
execution time of each iteration Jobs dfo_hadoop_terasort application, it is mainly how they are treated
approximations of the points for the object function the quadratic or linear form respectively.
1000
1200
1400
1600
1800
2000
2200
2400
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
timeinseconds
Difference between Algorithms COBYLA and BOBYQA
TeraSort 50 New/COBYLA TeraSort 50 B/COBYLA TeraSort 50 A/BOBYQA TeraSort 50 C/BOBYQA
27. Conclusion
• The convergence of the total time proves to be more
stable in COBYLA and without many fluctuations when
compared to BOBYQA algorithm.
• The Speedup BOBYQA algorithm in the execution of
TeraSort application is 12% on average
• And the results reported by COBYLA algorithm, in the
execution of TeraSort application demonstrates
Speedup on average 21.15% over the initial settings of
Hadoop and a greater optimization than the BOBYQA
algorithm.
28. References
• [1] O'Malley, O. (2008, May). TeraByte Sort on Apache Hadoop.
Retrieved from http://sortbenchmark.org/YahooHadoop.pdf
• [2] Anand, A. (2009, May). Hadoop Sorts a Petabyte in 16:25 Hours
and a Terabyte in 62 Seconds. Retrieved from
https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-
petabyte-16-25-hours-terabyte-62-422.html
• [3] Gray, J. (n.d.). Sort Benchmark Home Page. Retrieved from
http://sortbenchmark.org/
• [4] A Measure ofTransaction Processing Power. (1985) Datamation,
31 (7), 112-118.
• [5] Wikipedia; http://en.wikipedia.org/