BENCHMARKS
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Benchmarks and Benchmarking
• Benchmarking is the process of comparing
one's business processes and
performance metrics to industry bests or
best practices from other industries.
• Benchmarking is used to measure
performance using a specific indicator
(cost per unit of measure, productivity per
unit of measure, cycle time of x per unit of
measure or defects per unit of measure)
resulting in a metric of performance that is
then compared to othersDr. Amit Kumar, Dept of CSE,
JUET, Guna
Benchmarks and Benchmarking
• In computing, a benchmark is the act of
running a computer program, a set of
programs, or other operations, in order to
assess the relative performance of an
object, normally by running a number of
standard tests and trials against it.
• The term 'benchmark' is also mostly
utilized for the purposes of elaborately-
designed benchmarking programs
themselves. Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Benchmarks and Benchmarking
• Benchmarking is usually associated with
assessing performance characteristics of
computer hardware, for example, the
floating point operation performance of a
CPU, but there are circumstances when
the technique is also applicable to
software.
• Software benchmarks are, for example,
run against compilers or database
management systems.Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Benchmarks and Benchmarking
• Benchmarks provide a method of
comparing the performance of various
subsystems across different chip/system
architectures.
• Test suites are a type of system intended
to assess the correctness of software.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Benchmarks and Benchmarking
• In the 1970s, the concept of a benchmark
evolved beyond a technical term signifying
a reference point. The word migrated into
the lexicon of business, where it came to
signify the measurement process by which
to conduct comparisons.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Benchmarks and Benchmarking
• In the early 1980s, Xerox corporation, a
leader in benchmarking, define it as the
continuous process of measuring
products, services, and practices against
the toughest competitors.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
• Benchmarks, in contrast to benchmarking,
are measurements to evaluate the
performance of a function, operation or
business relative to others.
• In the electronic industry, for instance, a
benchmark has long referred to an
operating statistics that allows you to
compare your own performance to that of
another.
Benchmarks and Benchmarking
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Purpose of Benchmark
• As computer architecture advanced, it became
more difficult to compare the performance of
various computer systems simply by looking at
their specifications.
• Therefore, tests were developed that allowed
comparison of different architectures.
• For example, Pentium 4 processors generally
operate at a higher clock frequency than Athlon
XP processors, which does not necessarily
translate to more computational power.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Purpose of Benchmark
• A slower processor, with regard to clock
frequency, can perform as well as a processor
operating at a higher frequency.
• Benchmarks are designed to mimic a particular
type of workload on a component or system.
• Synthetic benchmarks do this by specially
created programs that impose the workload on
the component.
• Application benchmarks run real-world programs
on the system.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Purpose of Benchmark
• While application benchmarks usually give a
much better measure of real-world performance
on a given system, synthetic benchmarks are
useful for testing individual components, like a
hard disk or networking device.
• Benchmarks are particularly important in CPU
design, giving processor architects the ability to
measure and make tradeoffs in
microarchitectural decisions.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Purpose of Benchmark
• For example, if a benchmark extracts the key
algorithms of an application, it will contain the
performance-sensitive aspects of that
application. Running this much smaller snippet
on a cycle-accurate simulator can give clues on
how to improve performance.
• Prior to 2000, computer and microprocessor
architects used SPEC to do this, although
SPEC's Unix-based benchmarks were quite
lengthy and thus unwieldy to use intact.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
RELATION OF BENCHMARKS
WITH EMPIRICAL METHODS
• In many areas of Computer sciences,
experiments are the primary means of
demonstrating the potential and value of
systems and techniques.
• Empirical methods for analyzing and
comparing systems and techniques are of
considerable interest to many CS
researchers.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
• The main evaluation criteria that has been
adopted in some fields, like the
satisfiability testing (SAT), is empirical
performance on shared benchmark
problems.
• In the seminar “Future Directions in
Software Engineering”, many issues were
addressed; some of them were:
RELATION OF BENCHMARKS
WITH EMPIRICAL METHODS
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
• In the paper “Research Methodology in Software
Engineering” four methodologies were identified:
the scientific method, the engineering method,
the empirical method, and the analytical method.
• In paper “We Need To Measure The Quality Of
Our Work” the author point out that “we as a
community have no generally accepted methods
or benchmarks for measuring and comparing the
quality and utility of our research results”.
RELATION OF BENCHMARKS
WITH EMPIRICAL METHODS
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
RELATION OF BENCHMARKS
WITH EMPIRICAL METHODS
Examples:
• IEEE Computer Society Workshop on
Empirical Evaluation of Computer Vision
Algorithms.
– A benchmark for graphics recognition
systems
• An empirical comparison of C, C++, Java,
Perl, Python, Rexx, and Tcl
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
TYPES OF BENCHMARKS
• Real programs. They have input, output,
and options that a user can select when
running the program.
Examples: Compilers, word processing
software, tool software of CAD,user's
application software (i.e.: MIS) etc.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
TYPES OF BENCHMARKS
• Microbenchmark -
– Designed to measure the performance of a
very small and specific piece of code.
• Kernels -
– contains key codes
– normally abstracted from actual program
– popular kernel: Livermore loop
– linpack benchmark (contains basic linear
algebra subroutine written in FORTRAN
language)
– results are represented in MFLOPSDr. Amit Kumar, Dept of CSE,
JUET, Guna
TYPES OF BENCHMARKS
• Component Benchmark/ micro-benchmark –
– programs designed to measure performance of a
computer's basic components
– automatic detection of computer's hardware
parameters like number of registers, cache size,
memory latency
• I/O benchmarks
• Database benchmarks: to measure the throughput
and response times of database management
systems (DBMS')
• Parallel benchmarks: used on machines with multiple
cores, processors or systems consisting of multiple
machines
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
TYPES OF BENCHMARKS
• Synthetic Benchmark –
o Procedure for programming synthetic benchmark:
- take statistics of all types of operations from many
application programs
- get proportion of each operation
- write program based on the proportion above
o Types of Synthetic Benchmark are:
- Whetstone
- Dhrystone
o These were the first general purpose industry standard computer
benchmarks. They do not necessarily obtain high scores on modern
pipelined computers.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
TYPES OF BENCHMARKS
• Toy benchmarks. Typically between 10
and 100 lines of code and produce a result
the user already knows.
Examples: Sieve of Eratosthenes, Puzzle,
and Quicksort.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
BENCHMARK SUITES
• It is a collection of benchmarks to try to
measures the performance of processors
with a variety of applications.
• The advantage is that the weakness of
any one benchmark is lessened by the
presence of the other benchmarks.
• Some benchmarks of the suite are
kernels, but many are real programs.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
BENCHMARK SUITES
Example: SPEC92 benchmark suite (20 programs)
Benchmark Source Lines of code description
Espresso C 13,500 Minimize Boolean functions
Li C 7,413 Lisp interpreter (9 queen probl.)
Eqntott C 3,376 translate boolean equations
Compress C 1,503 Data compression
Sc C 8,116 Computation in a spreadsheet
Gcc C 83,589 GNU C compiler
Spice2g6 Fortran 18,476 Circuit Simulation Package
Doduc Fortran 5,334 Simulation of nuclear reactor
Mdljdp2 Fortran 4,458 Chemical application
Wave5 Fortran 7,628 Electromagnetic Simulation
Tomcatv Fortran 195 Mesh generation program
Ora Fortran 535 Traces rays through optical syst.
Alvinn C 272 Simulation in neural networks
Ear C 4,483 Inner ear model
…… Dr. Amit Kumar, Dept of CSE,
JUET, Guna
COMPARING PERFORMANCE
Program P1 (secs)
Program P2 (secs)
Program P3 (secs)
Computer A Computer B Computer C
1 10 20
1000 100 20
1001 110 40
Execution times of three programs on three machines
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
CPU Performance Measures
TOTAL EXECUTION TIME:
An average of the execution times that tracks total execution time is the arithmetic
mean
n

i=1
Timei
n
Where Timei is the execution for the
ith program of a total of n in the workload
When performance is expressed as a rate we use Harmonic mean:
1
n
i=1
Ratei

n 1
Where Ratei is a function of 1/timei , the
execution time for the ith of n programs
in the workload. It is used when performance
Is measured in MIPS or MFLOPS
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
CPU Performance Measures
WEIGHTED EXECUTION TIME
A question arises: What is the proper mixture of programs for the workload?
In the arithmetic mean we assume programs P1 and P2 run equally in the
Workload.
A weighted arithmetic mean is given by

n
i=1
Weighti x Timei Where Weighti is the frequency of the
ith program in the workload and Timei
Is the execution time of the program ‘i’
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
CPU Performance Measures
Program P1 (secs)
Program P2 (secs)
Arithmetic mean:W1
Arithmetic mean:W2
Comp A Comp B Comp C W1 W2 W3
1 10 20 .50 .909 .999
1000 100 20 .50 .091 .001
500.50 55.0 20.0
91.91 18.19 20.0
Arithmetic mean:W3 2.0 10.09 20.0
Weighted arithmetic mean execution times using three weightings
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
COMMON SYSTEM
BENCHMARKS
007 (OODBMS).
Designed to simulate a CAD/CAM environment.
Tests:
- Pointer traversals over cached data; disk resident data;
sparse traversals; and dense traversals
- Updates: indexed and unindexed object fields; repeated
updates; sparse updates; updates of cached data; and
creation
and deletion of objects
- Queries: exact match lookup; ranges; collection scan;
path-join; ad-hoc join; and single-level make.
Originator: University of Wisconsin
Versions: Unknown
Availability of Source: Free from ftp.cs.wisc.edu:/007
Availability of Results: Free from ftp.cs.wisc.edu:/007
Entry Last Updated: Thursday April 15 15:08:07 1993
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
AIM
AIM Technology, Palo Alto Two suites (III and V)
Suite III: simulation of applications (task- or device
specific)
- Task specific routines (word processing, database
management, accounting)
- Device specific routines (memory, disk, MFLOPs,
IOs)
- All measurements represent a percentage of VAX
11/780 performance (100%) In general, Suite III gives an
overall indication of performance.
Suite V: measures throughput in a multitasking
workstation environment by testing:
- Incremental system loading
- Multiple aspects of system performance The
graphically displayed results plot the workload level
versus time. Several different models characterize
various user environments (financial, publishing,
software engineering). The published reports are
copyrighted. Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Dhrystone
Short synthetic benchmark program intended to be representative of system
(integer) programming. Based on published statistics on use of programming
language features; see original publication in CACM 27,10 (Oct. 1984), 1013-1030.
Originally published in Ada, now mostly used in C. Version 2 (in C) published
in SIGPLAN Notices 23,8 (Aug. 1988), 49-62, together with measurement rules.
Version 1 is no longer recommended since state-of-the-art compilers can eliminate
too much "dead code" from the benchmark (However, quoted MIPS numbers are
often based on Version 1.)
Problems: Due to its small size (100 HLL statements, 1-1.5 KB code), the memory
system outside the cache is not tested; compilers can too easily optimize for
Dhrystone; and string operations are somewhat over represented.
Recommendation: Use it for controlled experiments only; don't blindly trust single
Dhrystone MIPS numbers quoted somewhere (as a rule, don't do this for any
benchmark).
Originator: Reinhold Weicker, Siemens Nixdorf
Versions in C: 1.0, 1.1, 2.0, 2.1 (final version, minor corrections compared with 2.0)
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Khornerstone
Multipurpose benchmark used in various periodicals.
Originator: Workstation Labs
Versions: unknown
Availability of Source: not free
Availability of Results: UNIX Review
LINPACK
Kernel benchmark developed from the "LINPACK" package of linear algebra
routines. Originally written and commonly used in FORTRAN; a C version also
exists. Almost all of the benchmark's time is spent in a subroutine ("saxpy" in
the single-precision version, "daxpy" in the double-precision version) doing the
inner loop for frequent matrix operations: y(i) = y(i) + a * x(i) The standard
version operates on 100x100 matrices; there are also versions for sizes 300x300
and 1000x1000, with different optimization rules.
Problems: Code is representative only for this type of computation. LINPACK
is easily vectorizable on most systems.
Originator: Jack Dongarra, Computer Science Deptartment,
University of Tennessee,
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
MUSBUS
Designed by Ken J. McDonell at the Monash University in Australia a very
good benchmark of disk throughput and the multi-user simulation.
Compile, create the directories and the workload for simulated users, and
execute the simulation three times by measuring cpu and elapsed time.
The workload is constituted by 11 commands (cc, rm, ed, ls, cp, spell, cat,
mkdir, export, chmod, and a nroff-like spooler) and 5 programs (syscall,
randmem, hanoi, pipe, and fstime). This is a very complete test which is a
significant measurement of the CPU speed, C compiler and UNIX quality,
file system performances and multi-user capabilities, disk throughput, and
memory management implementation.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Nhfsstone
A benchmark intended to measure the performance of file servers that follow
the NFS protocol. The work in this area continued within the LADDIS group
and finally within SPEC. The SPEC benchmark 097.LADDIS is intended to
replace Nhfsstone.
It is superior to Nhfsstone in several aspects (multi-client capability, less
client sensitivity).
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
SPEC
SPEC stands for Standard Performance Evaluation Corporation, a non-profit
organization whose goal is to "establish, maintain and endorse a standardized
set of relevant benchmarks that can be applied to the newest generation of
high performance computers" (from SPEC's bylaws). The SPEC benchmarks
and more information can be obtained from:
SPEC [Standard Performance Evaluation Corporation]
c/o NCGA [National Computer Graphics Association]
2722 Merrilee Drive
Suite 200
Fairfax, VA 22031
USA
The current SPEC benchmark suites are:
CINT92 (CPU intensive integer benchmarks)
CFP92 (CPU intensive floating point benchmarks)
SDM (UNIX Software Development Workloads)
SFS (System level file server (NFS) workload)
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
SSBA
The SSBA is the result of the studies of the AFUU (French Association of
UNIX Users) Benchmark Working Group. This group, consisting of some
30 active members of varied origins (universities, public and private
research,
manufacturers, end users), has assigned itself the task of assessing the
performance of data processing systems, collecting a maximum number
of tests available throughout the world, dissecting the codes and results,
discussing the utility, fixing versions, and supplying them with various
comments and procedures.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Sieve of Eratosthenes
An integer program that generates prime numbers using a method
known as the Sieve of Eratosthenes.
TPC
TPC-A is a standardization of the Debit/Credit benchmark which was first published
in DATAMATION in 1985. It is based on a single, simple, update-intensive
transaction which performs three updates and one insert across four tables.
Transactions originate from terminals, with a requirement of 100 bytes in and
200 bytes out. There is a fixed scaling between tps rate, terminals, and
database size. TPC-A requires an external RTE (remote terminal emulator) to
drive the SUT (system under test). The system performs five kinds of transactions:
entering a new order, delivering orders, posting customer payments, retrieving a
customer's most recent order, and monitoring the inventory level of recently ordered
items
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
WPI Benchmark Suite
The first major synthetic benchmark program, intended to be representative
for numerical (floating point intensive) programming. Based on statistics
gathered at National Physical Laboratory in England, using an Algol 60 compiler
which translated Algol into instructions for the imaginary Whetstone machine.
The compilation system was named after the small town outside the City of
Leicester, England, where it was designed (Whetstone).
Problems: Due to the small size of its modules, the memory system outside
the cache is not tested; compilers can too easily optimize for Whetstone;
mathematical library functions are over represented.
Originator: Brian Wichmann, NPL
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Whetstone
One of the first and very popular benchmarks, the WHETSTONE was
originally published in 1976 by Curnow and Wichman in algol and
subsequently translated into FORTRAN. This synthetic mix of elementary
Whetstone instructions is modeled with statistics from about 1000 scientific
and engineering applications. The WHETSTONE is rather small and, due
to its straightforward coding, may be prone to particular (and unintentional)
treatment by intelligent compilers. It is very sensitive to the transcendental
and trigonometric functions processing, and completely dependent on fast
or additional mathematics coprocessor. The WHETSTONE is a good
predictor for engineering and scientific applications.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
SYSmark
SYSmark93 provides benchmarks that can be used to measure
performance of IBM PC-compatible hardware for the tasks users
perform on a regular basis. SYSmark93 benchmarks represent
the workloads of popular programs in such applications as word
processing, spreadsheets, database, desktop graphics, and
software development.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Stanford
A collection of C routines developed in 1988 at Stanford University
(J. Hennessy, P. Nye). Its two modules, Stanford Integer and Stanford
Floating Point, provide a baseline for comparisons between Reduced
Instruction Set (RISC) and Complex Instruction Set (CISC) processor
architectures
Stanford Integer:
- Eight applications (integer matrix multiplication, sorting algorithm
[quick, bubble, tree], permutation, hanoi, 8 queens puzzle)
Stanford Floating Point:
- Two applications (Fast Fourier Transform [FFT] and matrix multiplication)
The characteristics of the programs vary, but most of them have array
accesses. There seems to be no official publication (only a printing in a
performance report). Secondly, there is no defined weighting of the
results (Sun and MIPS compute the geometric mean).
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Bonnie
This is a file system benchmark that attempts to study bottlenecks.
Specifically, these are the types of filesystem activity that have
been
observed to be bottlenecks in I/O-intensive applications, in
particular
the text database work done in connection with the New Oxford
English Dictionary Project at the University of Waterloo. It performs
a series of tests on a file of known size. By default, that size is
100 Mb (but that's not enough - see below). For each test, Bonnie
reports the bytes processed per elapsed second, per CPU second,
and the percent CPU usage (user and system). In each case,
an attempt is made to keep optimizers from noticing it's all bogus.
The idea is to make sure that these are real transfers to/from user
space to the physical disk.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
IOBENCH
IOBENCH is a multi-stream benchmark that uses a controlling process
(iobench) to start, coordinate, and measure a number of "user" processes
(iouser); the Makefile parameters used for the SPEC version of IOBENCH
cause ioserver to be built as a "do nothing" process.
IOZONE
This test writes an X MB sequential file in Y byte chunks, then rewinds it
and reads it back. [The size of the file should be big enough to factor out
the effect of any disk cache.] Finally, IOZONE deletes the temporary file.
The file is written (filling any cache buffers), and then read.
If the cache is >= X MB, then most if not all of the reads will be satisfied
from the cache. However, if the cache is <= .5X MB, then NONE of the
reads will be satisfied from the cache. This is because after the file is written,
a .5X MB cache will contain the upper .5 MB of the test file, but we will start
reading from the beginning of the file (data which is no longer in the cache).
In order for this to be a fair test, the length of the test file must be AT LEAST
2X the amount of disk cache memory for your system. If not, you are really
testing the speed at which your CPU can read blocks out of the cache
(not a fair test).
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Byte
This famous test taken from Byte (1984), originally targeted at microcomputers,
is a benchmark suite similar in spirit to SPEC, except that it is smaller and
contains mostly things like "Sieve of Eratosthenes" and "Dhrystone".
If you are comparing different UNIX machines for performance, this gives
fairly good numbers.
Netperf
A networking performance benchmark/tool. Includes throughput (bandwidth)
and request/response (latency) tests for TCP and UDP using the
BSD sockets API, DLPI, UNIX Domain Sockets, the Fore ATM API,
and HP HiPPI Link Level Access.
See ftp://ftp.cup.hp.com/dist/networking/benchmarks and
ftp://sgi.com
Nettest
A network performance analysis tool developed at Cray.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
TTCP
TTCP is a benchmarking tool for determining TCP and UDP performance
between two systems. TTCP times the transmission and reception of data
between two systems using the UDP or TCP protocols. It differs from
common "blast" tests, which tend to measure the remote Internet daemon (inetd)
as much as the network performance, and which usually do not allow
measurements at the remote end of a UDP transmission.
This program was created at the US Army Ballistics Research Laboratory (BRL).
CPU2
The CPU2 benchmark was invented by Digital Review
(now Digital News and Review). To quote DEC, describing DN&R's benchmark,
CPU2 "...is a floating point intensive series of FORTRAN programs and consists
of thirty-four separate tests. The benchmark is most relevant in predicting the
performance of engineering and scientific applications. Performance is
expressed as a multiple of MicroVAX II Units of Performance.
The CPU2 benchmark is available via anonymous ftp from
swedishchef.lerc.nasa.gov in the drlabs/cpu directory.
Get cpu2.unix.tar.Z for unix systems or cpu2.vms.tar.Z for VMS systems."
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Hartstone
Hartstone is a benchmark for measuring various aspects of hard real time
systems from the Software Engineering Institute at Carnegie Mellon.
PC Bench/WinBench/NetBench
PC Bench 9.0, WinBench 95 Version 1.0, Winstone 95 Version 1.0,
MacBench 2.0, NetBench 3.01, and ServerBench 2.0 are the current names
and versions of the benchmarks available from the Ziff-Davis Benchmark
Operation (ZDBOp)
Sim
An integer program that compares DNA segments for similarity.
Fhourstones
A small integer-only program that solves positions in the game of connect-4
using exhaustive search with a very large transposition table. Written in C.
Heapsort
An integer program that uses the "heap sort" method of sorting a random
array of long integers up to 2 MB in size.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Hanoi
An integer program that solves the Towers of Hanoi puzzle using recursive
function calls.
Flops C
Estimates MFLOPS rating for specific floating point add, subtract, multiply,
and divide (FADD, FSUB, FMUL, and FDIV) instruction mixes. Four distinct
MFLOPS ratings are provided based on the FDIV weightings from 25% to 0%
and using register-to-register operations. Works with both scalar and vector
machines.
C LINPACK
The LINPACK floating point program converted to C.
TFFTDP
This program performs FFTs using the Duhamel-Hollman method for FFTs
from 32 to 262,144 points in size.
Matrix Multiply (MM)
This program contains nine different algorithms for doing matrix
multiplication (500 X 500 standard size). Results illustrate the effects
of cache thrashing versus algorithm, machine, compiler, and compiler options.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
BENCHMARKING PITFALLS?
• Optimization option on today’s compilers
can affect the results of benchmark tests.
• Modification of the sources (public-domain
software) produces different versions of
the benchmark.
• Many benchmarks are one-dimensional in
nature (test only one aspect of a system).
different aspects to test are: CPU, I/O, File
System, etc.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
• A compiler can “recognize” a benchmark
suite and loads a hand-optimized
algorithms for the test.
BENCHMARKING PITFALLS?
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
RECOMMENDATIONS
• A user should determine which aspects of
system or component performance are to be
measured.
• Determine the best source of benchmark suites
or performance data (either public-domain or
licensed third-party packages).
• Ensure that all system-hardware and OS
parameters during benchmark comparisons
equate as closely as possible.
• Understand what specific benchmark tests
measure and what causes the results to vary.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
BENCHMARKING RULES
(Example in Neural networks)
• describe and standardize ways of setting up experiments,
documenting these setups, measuring results, and documenting
these results (goal: maximize comparability of experimental results)
• Problem: name, address, version/variant.
• Training set, validation set, test set.
• Network: nodes, connections, activation functions.
• Initialization.
• Algorithm parameters and parameter adaption rules.
• Termination, phase transition, and restarting criteria.
• Error function and its normalization on the results reported.
• Number of runs, rules for including or excluding runs in results
reported.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna

Benchmarks

  • 1.
    BENCHMARKS Dr. Amit Kumar,Dept of CSE, JUET, Guna
  • 2.
    Benchmarks and Benchmarking •Benchmarking is the process of comparing one's business processes and performance metrics to industry bests or best practices from other industries. • Benchmarking is used to measure performance using a specific indicator (cost per unit of measure, productivity per unit of measure, cycle time of x per unit of measure or defects per unit of measure) resulting in a metric of performance that is then compared to othersDr. Amit Kumar, Dept of CSE, JUET, Guna
  • 3.
    Benchmarks and Benchmarking •In computing, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. • The term 'benchmark' is also mostly utilized for the purposes of elaborately- designed benchmarking programs themselves. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 4.
    Benchmarks and Benchmarking •Benchmarking is usually associated with assessing performance characteristics of computer hardware, for example, the floating point operation performance of a CPU, but there are circumstances when the technique is also applicable to software. • Software benchmarks are, for example, run against compilers or database management systems.Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 5.
    Benchmarks and Benchmarking •Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures. • Test suites are a type of system intended to assess the correctness of software. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 6.
    Benchmarks and Benchmarking •In the 1970s, the concept of a benchmark evolved beyond a technical term signifying a reference point. The word migrated into the lexicon of business, where it came to signify the measurement process by which to conduct comparisons. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 7.
    Benchmarks and Benchmarking •In the early 1980s, Xerox corporation, a leader in benchmarking, define it as the continuous process of measuring products, services, and practices against the toughest competitors. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 8.
    • Benchmarks, incontrast to benchmarking, are measurements to evaluate the performance of a function, operation or business relative to others. • In the electronic industry, for instance, a benchmark has long referred to an operating statistics that allows you to compare your own performance to that of another. Benchmarks and Benchmarking Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 9.
    Purpose of Benchmark •As computer architecture advanced, it became more difficult to compare the performance of various computer systems simply by looking at their specifications. • Therefore, tests were developed that allowed comparison of different architectures. • For example, Pentium 4 processors generally operate at a higher clock frequency than Athlon XP processors, which does not necessarily translate to more computational power. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 10.
    Purpose of Benchmark •A slower processor, with regard to clock frequency, can perform as well as a processor operating at a higher frequency. • Benchmarks are designed to mimic a particular type of workload on a component or system. • Synthetic benchmarks do this by specially created programs that impose the workload on the component. • Application benchmarks run real-world programs on the system. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 11.
    Purpose of Benchmark •While application benchmarks usually give a much better measure of real-world performance on a given system, synthetic benchmarks are useful for testing individual components, like a hard disk or networking device. • Benchmarks are particularly important in CPU design, giving processor architects the ability to measure and make tradeoffs in microarchitectural decisions. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 12.
    Purpose of Benchmark •For example, if a benchmark extracts the key algorithms of an application, it will contain the performance-sensitive aspects of that application. Running this much smaller snippet on a cycle-accurate simulator can give clues on how to improve performance. • Prior to 2000, computer and microprocessor architects used SPEC to do this, although SPEC's Unix-based benchmarks were quite lengthy and thus unwieldy to use intact. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 13.
    RELATION OF BENCHMARKS WITHEMPIRICAL METHODS • In many areas of Computer sciences, experiments are the primary means of demonstrating the potential and value of systems and techniques. • Empirical methods for analyzing and comparing systems and techniques are of considerable interest to many CS researchers. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 14.
    • The mainevaluation criteria that has been adopted in some fields, like the satisfiability testing (SAT), is empirical performance on shared benchmark problems. • In the seminar “Future Directions in Software Engineering”, many issues were addressed; some of them were: RELATION OF BENCHMARKS WITH EMPIRICAL METHODS Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 15.
    • In thepaper “Research Methodology in Software Engineering” four methodologies were identified: the scientific method, the engineering method, the empirical method, and the analytical method. • In paper “We Need To Measure The Quality Of Our Work” the author point out that “we as a community have no generally accepted methods or benchmarks for measuring and comparing the quality and utility of our research results”. RELATION OF BENCHMARKS WITH EMPIRICAL METHODS Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 16.
    RELATION OF BENCHMARKS WITHEMPIRICAL METHODS Examples: • IEEE Computer Society Workshop on Empirical Evaluation of Computer Vision Algorithms. – A benchmark for graphics recognition systems • An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 17.
    TYPES OF BENCHMARKS •Real programs. They have input, output, and options that a user can select when running the program. Examples: Compilers, word processing software, tool software of CAD,user's application software (i.e.: MIS) etc. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 18.
    TYPES OF BENCHMARKS •Microbenchmark - – Designed to measure the performance of a very small and specific piece of code. • Kernels - – contains key codes – normally abstracted from actual program – popular kernel: Livermore loop – linpack benchmark (contains basic linear algebra subroutine written in FORTRAN language) – results are represented in MFLOPSDr. Amit Kumar, Dept of CSE, JUET, Guna
  • 19.
    TYPES OF BENCHMARKS •Component Benchmark/ micro-benchmark – – programs designed to measure performance of a computer's basic components – automatic detection of computer's hardware parameters like number of registers, cache size, memory latency • I/O benchmarks • Database benchmarks: to measure the throughput and response times of database management systems (DBMS') • Parallel benchmarks: used on machines with multiple cores, processors or systems consisting of multiple machines Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 20.
    TYPES OF BENCHMARKS •Synthetic Benchmark – o Procedure for programming synthetic benchmark: - take statistics of all types of operations from many application programs - get proportion of each operation - write program based on the proportion above o Types of Synthetic Benchmark are: - Whetstone - Dhrystone o These were the first general purpose industry standard computer benchmarks. They do not necessarily obtain high scores on modern pipelined computers. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 21.
    TYPES OF BENCHMARKS •Toy benchmarks. Typically between 10 and 100 lines of code and produce a result the user already knows. Examples: Sieve of Eratosthenes, Puzzle, and Quicksort. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 22.
    BENCHMARK SUITES • Itis a collection of benchmarks to try to measures the performance of processors with a variety of applications. • The advantage is that the weakness of any one benchmark is lessened by the presence of the other benchmarks. • Some benchmarks of the suite are kernels, but many are real programs. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 23.
    BENCHMARK SUITES Example: SPEC92benchmark suite (20 programs) Benchmark Source Lines of code description Espresso C 13,500 Minimize Boolean functions Li C 7,413 Lisp interpreter (9 queen probl.) Eqntott C 3,376 translate boolean equations Compress C 1,503 Data compression Sc C 8,116 Computation in a spreadsheet Gcc C 83,589 GNU C compiler Spice2g6 Fortran 18,476 Circuit Simulation Package Doduc Fortran 5,334 Simulation of nuclear reactor Mdljdp2 Fortran 4,458 Chemical application Wave5 Fortran 7,628 Electromagnetic Simulation Tomcatv Fortran 195 Mesh generation program Ora Fortran 535 Traces rays through optical syst. Alvinn C 272 Simulation in neural networks Ear C 4,483 Inner ear model …… Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 24.
    COMPARING PERFORMANCE Program P1(secs) Program P2 (secs) Program P3 (secs) Computer A Computer B Computer C 1 10 20 1000 100 20 1001 110 40 Execution times of three programs on three machines Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 25.
    CPU Performance Measures TOTALEXECUTION TIME: An average of the execution times that tracks total execution time is the arithmetic mean n  i=1 Timei n Where Timei is the execution for the ith program of a total of n in the workload When performance is expressed as a rate we use Harmonic mean: 1 n i=1 Ratei  n 1 Where Ratei is a function of 1/timei , the execution time for the ith of n programs in the workload. It is used when performance Is measured in MIPS or MFLOPS Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 26.
    CPU Performance Measures WEIGHTEDEXECUTION TIME A question arises: What is the proper mixture of programs for the workload? In the arithmetic mean we assume programs P1 and P2 run equally in the Workload. A weighted arithmetic mean is given by  n i=1 Weighti x Timei Where Weighti is the frequency of the ith program in the workload and Timei Is the execution time of the program ‘i’ Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 27.
    CPU Performance Measures ProgramP1 (secs) Program P2 (secs) Arithmetic mean:W1 Arithmetic mean:W2 Comp A Comp B Comp C W1 W2 W3 1 10 20 .50 .909 .999 1000 100 20 .50 .091 .001 500.50 55.0 20.0 91.91 18.19 20.0 Arithmetic mean:W3 2.0 10.09 20.0 Weighted arithmetic mean execution times using three weightings Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 28.
    COMMON SYSTEM BENCHMARKS 007 (OODBMS). Designedto simulate a CAD/CAM environment. Tests: - Pointer traversals over cached data; disk resident data; sparse traversals; and dense traversals - Updates: indexed and unindexed object fields; repeated updates; sparse updates; updates of cached data; and creation and deletion of objects - Queries: exact match lookup; ranges; collection scan; path-join; ad-hoc join; and single-level make. Originator: University of Wisconsin Versions: Unknown Availability of Source: Free from ftp.cs.wisc.edu:/007 Availability of Results: Free from ftp.cs.wisc.edu:/007 Entry Last Updated: Thursday April 15 15:08:07 1993 Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 29.
    AIM AIM Technology, PaloAlto Two suites (III and V) Suite III: simulation of applications (task- or device specific) - Task specific routines (word processing, database management, accounting) - Device specific routines (memory, disk, MFLOPs, IOs) - All measurements represent a percentage of VAX 11/780 performance (100%) In general, Suite III gives an overall indication of performance. Suite V: measures throughput in a multitasking workstation environment by testing: - Incremental system loading - Multiple aspects of system performance The graphically displayed results plot the workload level versus time. Several different models characterize various user environments (financial, publishing, software engineering). The published reports are copyrighted. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 30.
    Dhrystone Short synthetic benchmarkprogram intended to be representative of system (integer) programming. Based on published statistics on use of programming language features; see original publication in CACM 27,10 (Oct. 1984), 1013-1030. Originally published in Ada, now mostly used in C. Version 2 (in C) published in SIGPLAN Notices 23,8 (Aug. 1988), 49-62, together with measurement rules. Version 1 is no longer recommended since state-of-the-art compilers can eliminate too much "dead code" from the benchmark (However, quoted MIPS numbers are often based on Version 1.) Problems: Due to its small size (100 HLL statements, 1-1.5 KB code), the memory system outside the cache is not tested; compilers can too easily optimize for Dhrystone; and string operations are somewhat over represented. Recommendation: Use it for controlled experiments only; don't blindly trust single Dhrystone MIPS numbers quoted somewhere (as a rule, don't do this for any benchmark). Originator: Reinhold Weicker, Siemens Nixdorf Versions in C: 1.0, 1.1, 2.0, 2.1 (final version, minor corrections compared with 2.0) Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 31.
    Khornerstone Multipurpose benchmark usedin various periodicals. Originator: Workstation Labs Versions: unknown Availability of Source: not free Availability of Results: UNIX Review LINPACK Kernel benchmark developed from the "LINPACK" package of linear algebra routines. Originally written and commonly used in FORTRAN; a C version also exists. Almost all of the benchmark's time is spent in a subroutine ("saxpy" in the single-precision version, "daxpy" in the double-precision version) doing the inner loop for frequent matrix operations: y(i) = y(i) + a * x(i) The standard version operates on 100x100 matrices; there are also versions for sizes 300x300 and 1000x1000, with different optimization rules. Problems: Code is representative only for this type of computation. LINPACK is easily vectorizable on most systems. Originator: Jack Dongarra, Computer Science Deptartment, University of Tennessee, Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 32.
    MUSBUS Designed by KenJ. McDonell at the Monash University in Australia a very good benchmark of disk throughput and the multi-user simulation. Compile, create the directories and the workload for simulated users, and execute the simulation three times by measuring cpu and elapsed time. The workload is constituted by 11 commands (cc, rm, ed, ls, cp, spell, cat, mkdir, export, chmod, and a nroff-like spooler) and 5 programs (syscall, randmem, hanoi, pipe, and fstime). This is a very complete test which is a significant measurement of the CPU speed, C compiler and UNIX quality, file system performances and multi-user capabilities, disk throughput, and memory management implementation. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 33.
    Nhfsstone A benchmark intendedto measure the performance of file servers that follow the NFS protocol. The work in this area continued within the LADDIS group and finally within SPEC. The SPEC benchmark 097.LADDIS is intended to replace Nhfsstone. It is superior to Nhfsstone in several aspects (multi-client capability, less client sensitivity). Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 34.
    SPEC SPEC stands forStandard Performance Evaluation Corporation, a non-profit organization whose goal is to "establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high performance computers" (from SPEC's bylaws). The SPEC benchmarks and more information can be obtained from: SPEC [Standard Performance Evaluation Corporation] c/o NCGA [National Computer Graphics Association] 2722 Merrilee Drive Suite 200 Fairfax, VA 22031 USA The current SPEC benchmark suites are: CINT92 (CPU intensive integer benchmarks) CFP92 (CPU intensive floating point benchmarks) SDM (UNIX Software Development Workloads) SFS (System level file server (NFS) workload) Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 35.
    SSBA The SSBA isthe result of the studies of the AFUU (French Association of UNIX Users) Benchmark Working Group. This group, consisting of some 30 active members of varied origins (universities, public and private research, manufacturers, end users), has assigned itself the task of assessing the performance of data processing systems, collecting a maximum number of tests available throughout the world, dissecting the codes and results, discussing the utility, fixing versions, and supplying them with various comments and procedures. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 36.
    Sieve of Eratosthenes Aninteger program that generates prime numbers using a method known as the Sieve of Eratosthenes. TPC TPC-A is a standardization of the Debit/Credit benchmark which was first published in DATAMATION in 1985. It is based on a single, simple, update-intensive transaction which performs three updates and one insert across four tables. Transactions originate from terminals, with a requirement of 100 bytes in and 200 bytes out. There is a fixed scaling between tps rate, terminals, and database size. TPC-A requires an external RTE (remote terminal emulator) to drive the SUT (system under test). The system performs five kinds of transactions: entering a new order, delivering orders, posting customer payments, retrieving a customer's most recent order, and monitoring the inventory level of recently ordered items Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 37.
    WPI Benchmark Suite Thefirst major synthetic benchmark program, intended to be representative for numerical (floating point intensive) programming. Based on statistics gathered at National Physical Laboratory in England, using an Algol 60 compiler which translated Algol into instructions for the imaginary Whetstone machine. The compilation system was named after the small town outside the City of Leicester, England, where it was designed (Whetstone). Problems: Due to the small size of its modules, the memory system outside the cache is not tested; compilers can too easily optimize for Whetstone; mathematical library functions are over represented. Originator: Brian Wichmann, NPL Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 38.
    Whetstone One of thefirst and very popular benchmarks, the WHETSTONE was originally published in 1976 by Curnow and Wichman in algol and subsequently translated into FORTRAN. This synthetic mix of elementary Whetstone instructions is modeled with statistics from about 1000 scientific and engineering applications. The WHETSTONE is rather small and, due to its straightforward coding, may be prone to particular (and unintentional) treatment by intelligent compilers. It is very sensitive to the transcendental and trigonometric functions processing, and completely dependent on fast or additional mathematics coprocessor. The WHETSTONE is a good predictor for engineering and scientific applications. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 39.
    SYSmark SYSmark93 provides benchmarksthat can be used to measure performance of IBM PC-compatible hardware for the tasks users perform on a regular basis. SYSmark93 benchmarks represent the workloads of popular programs in such applications as word processing, spreadsheets, database, desktop graphics, and software development. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 40.
    Stanford A collection ofC routines developed in 1988 at Stanford University (J. Hennessy, P. Nye). Its two modules, Stanford Integer and Stanford Floating Point, provide a baseline for comparisons between Reduced Instruction Set (RISC) and Complex Instruction Set (CISC) processor architectures Stanford Integer: - Eight applications (integer matrix multiplication, sorting algorithm [quick, bubble, tree], permutation, hanoi, 8 queens puzzle) Stanford Floating Point: - Two applications (Fast Fourier Transform [FFT] and matrix multiplication) The characteristics of the programs vary, but most of them have array accesses. There seems to be no official publication (only a printing in a performance report). Secondly, there is no defined weighting of the results (Sun and MIPS compute the geometric mean). Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 41.
    Bonnie This is afile system benchmark that attempts to study bottlenecks. Specifically, these are the types of filesystem activity that have been observed to be bottlenecks in I/O-intensive applications, in particular the text database work done in connection with the New Oxford English Dictionary Project at the University of Waterloo. It performs a series of tests on a file of known size. By default, that size is 100 Mb (but that's not enough - see below). For each test, Bonnie reports the bytes processed per elapsed second, per CPU second, and the percent CPU usage (user and system). In each case, an attempt is made to keep optimizers from noticing it's all bogus. The idea is to make sure that these are real transfers to/from user space to the physical disk. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 42.
    IOBENCH IOBENCH is amulti-stream benchmark that uses a controlling process (iobench) to start, coordinate, and measure a number of "user" processes (iouser); the Makefile parameters used for the SPEC version of IOBENCH cause ioserver to be built as a "do nothing" process. IOZONE This test writes an X MB sequential file in Y byte chunks, then rewinds it and reads it back. [The size of the file should be big enough to factor out the effect of any disk cache.] Finally, IOZONE deletes the temporary file. The file is written (filling any cache buffers), and then read. If the cache is >= X MB, then most if not all of the reads will be satisfied from the cache. However, if the cache is <= .5X MB, then NONE of the reads will be satisfied from the cache. This is because after the file is written, a .5X MB cache will contain the upper .5 MB of the test file, but we will start reading from the beginning of the file (data which is no longer in the cache). In order for this to be a fair test, the length of the test file must be AT LEAST 2X the amount of disk cache memory for your system. If not, you are really testing the speed at which your CPU can read blocks out of the cache (not a fair test). Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 43.
    Byte This famous testtaken from Byte (1984), originally targeted at microcomputers, is a benchmark suite similar in spirit to SPEC, except that it is smaller and contains mostly things like "Sieve of Eratosthenes" and "Dhrystone". If you are comparing different UNIX machines for performance, this gives fairly good numbers. Netperf A networking performance benchmark/tool. Includes throughput (bandwidth) and request/response (latency) tests for TCP and UDP using the BSD sockets API, DLPI, UNIX Domain Sockets, the Fore ATM API, and HP HiPPI Link Level Access. See ftp://ftp.cup.hp.com/dist/networking/benchmarks and ftp://sgi.com Nettest A network performance analysis tool developed at Cray. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 44.
    TTCP TTCP is abenchmarking tool for determining TCP and UDP performance between two systems. TTCP times the transmission and reception of data between two systems using the UDP or TCP protocols. It differs from common "blast" tests, which tend to measure the remote Internet daemon (inetd) as much as the network performance, and which usually do not allow measurements at the remote end of a UDP transmission. This program was created at the US Army Ballistics Research Laboratory (BRL). CPU2 The CPU2 benchmark was invented by Digital Review (now Digital News and Review). To quote DEC, describing DN&R's benchmark, CPU2 "...is a floating point intensive series of FORTRAN programs and consists of thirty-four separate tests. The benchmark is most relevant in predicting the performance of engineering and scientific applications. Performance is expressed as a multiple of MicroVAX II Units of Performance. The CPU2 benchmark is available via anonymous ftp from swedishchef.lerc.nasa.gov in the drlabs/cpu directory. Get cpu2.unix.tar.Z for unix systems or cpu2.vms.tar.Z for VMS systems." Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 45.
    Hartstone Hartstone is abenchmark for measuring various aspects of hard real time systems from the Software Engineering Institute at Carnegie Mellon. PC Bench/WinBench/NetBench PC Bench 9.0, WinBench 95 Version 1.0, Winstone 95 Version 1.0, MacBench 2.0, NetBench 3.01, and ServerBench 2.0 are the current names and versions of the benchmarks available from the Ziff-Davis Benchmark Operation (ZDBOp) Sim An integer program that compares DNA segments for similarity. Fhourstones A small integer-only program that solves positions in the game of connect-4 using exhaustive search with a very large transposition table. Written in C. Heapsort An integer program that uses the "heap sort" method of sorting a random array of long integers up to 2 MB in size. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 46.
    Hanoi An integer programthat solves the Towers of Hanoi puzzle using recursive function calls. Flops C Estimates MFLOPS rating for specific floating point add, subtract, multiply, and divide (FADD, FSUB, FMUL, and FDIV) instruction mixes. Four distinct MFLOPS ratings are provided based on the FDIV weightings from 25% to 0% and using register-to-register operations. Works with both scalar and vector machines. C LINPACK The LINPACK floating point program converted to C. TFFTDP This program performs FFTs using the Duhamel-Hollman method for FFTs from 32 to 262,144 points in size. Matrix Multiply (MM) This program contains nine different algorithms for doing matrix multiplication (500 X 500 standard size). Results illustrate the effects of cache thrashing versus algorithm, machine, compiler, and compiler options. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 47.
    BENCHMARKING PITFALLS? • Optimizationoption on today’s compilers can affect the results of benchmark tests. • Modification of the sources (public-domain software) produces different versions of the benchmark. • Many benchmarks are one-dimensional in nature (test only one aspect of a system). different aspects to test are: CPU, I/O, File System, etc. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 48.
    • A compilercan “recognize” a benchmark suite and loads a hand-optimized algorithms for the test. BENCHMARKING PITFALLS? Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 49.
    RECOMMENDATIONS • A usershould determine which aspects of system or component performance are to be measured. • Determine the best source of benchmark suites or performance data (either public-domain or licensed third-party packages). • Ensure that all system-hardware and OS parameters during benchmark comparisons equate as closely as possible. • Understand what specific benchmark tests measure and what causes the results to vary. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 50.
    BENCHMARKING RULES (Example inNeural networks) • describe and standardize ways of setting up experiments, documenting these setups, measuring results, and documenting these results (goal: maximize comparability of experimental results) • Problem: name, address, version/variant. • Training set, validation set, test set. • Network: nodes, connections, activation functions. • Initialization. • Algorithm parameters and parameter adaption rules. • Termination, phase transition, and restarting criteria. • Error function and its normalization on the results reported. • Number of runs, rules for including or excluding runs in results reported. Dr. Amit Kumar, Dept of CSE, JUET, Guna