Software + Babies

Software + Babies
How to design software and APIs for parallelism on
modern hardware
Richard Parker
Mathematician and freelance computer
programmer in Cambridge, England
Cologne, 20.01.2016

Maths research 2011-2014.
• My original (hobby) project was to work
with matrices mod 2, obsessively fast.
• I considered even a 5% speedup
compulsory.
• Highly optimized, but up to 30-year-old
programs have been used for many years.
• Mine were ~300 times faster in the end.

I learned a lot of tricks.
• I tried to consider every possible trick.
• I had to understand how all the parts of the
(x86) microprocessor worked.
• And Intel AVX-512F when it comes.
• I have been blogging to meataxe64 (on
wordpress) if anyone is interested.

Is this useful commercially?
• I felt there must be situations where
performance improvements of x10 or x100
would be useful in the real world.
• I have recently been applying these ideas
in the context of ArangoDB database.
• My conclusion is that there are three main
software considerations which can have a
major (x10) influence on performance.

3 pillars of performance.
• Multi-Core. If you use a lot of cores, you
can get a lot more done per second.
• Use Cache. Fetches from real memory
take x100 as long as from cache.
Real-memory bandwidth matters too.
• Single thread parallelism Even one
processor can do many things at once.

Multi-Core
• 16 cores can do a lot more work per
second than one core.
• Using them is not always easy, and the
results are often disappointing. . .
• because the memory system often gets
saturated, perhaps even with 4 cores, and
using more cores doesn't help.

Make better use of cache.
• This often requires a new algorithm.
• This work is still in its infancy, but it is
gradually becoming mainstream.
• Two algorithms with the same “complexity”
such as quicksort and heapsort can easily
differ by a factor of 10 because of cache.

Single Thread Parallelism
• This Haswell processor is designed to
issue four instructions every clock cycle.
• Three of those four can be “vector”
instructions, operating on the 256-bit
vector registers.
• The result is that, roughly speaking, the
processor can do 20-40 similar things in
the same time it takes to do one.

Higher level implications.
• In a sense, these are all low-level tricks.
• But to make good use of all these low-level
tricks, the high-level software must also
make changes. . . and one is critical.
• any subroutine call should do a batch of
stuff, not just one thing.

Fred Brooks.
• In “the mythical man month”, Brooks wrote
"nine women can't make a baby
in one month".
• He was talking about writing programs, not
running them, but the same principle is
important, and getting more important, for
program execution.

Ask for more Babies.
• In its simplest form, don't write
for(i=0;i<20;i++) MakeOneBaby()
• instead write
MakeSomeBabies(20)
• The first version takes 15 years to execute.
• It must surely be clear that the second
version can be done faster, depending on
the resources available. :-)

Example - Cosine
• The computation of the cosine of an angle
takes about 50 clock cycles.
double a = cosine(double ang)
• The “babies” version can manage about
one cosine every 5 clock cycles once it
gets going. Ten times faster.
void cosines(int ct,
double * ang, double * a)

Why is Babies cosine faster?
• 4 identical operations can be done in the
vector registers at the same speed as
one. This gives an immediate factor of 4.
• The hardware can start another cosine
rather than wait for an intermediate result.
• Duplicated work (in this case loading the
coefficients of the series expansion) can
be done just once.

Even if count is 1 . . .
• With just a little care, the babies version
with a count of 1 is no slower than the non-
babies version.
• For example, the routine could start by
testing whether the count is 1, and
branching to new code if the count is not 1.
• The (not-taken) branch will be correctly
predicted if the count always is 1, and
usually no time at all will be lost.

Throughput, not Latency.
• for(i=0;i<20;i++) makeonebaby()
This depends on the Latency - time from
starting one to finishing that one.
• makesomebabies(20)
This depends on the Throughput -
number that can be done on average in
unit time.

Latency constant.
Throughput increasing.
• We seem to have hit laws of physics with
latency. It is quite hard to get electronics to
(say) do a double-precision multiply and
get the answer faster than ~2 nSec.
• But there is little to prevent it doing a lot at
once. Indeed it can do forty since Sandy-
Bridge.

Think of x86 as many units.
• Each single x86 core . . .
• Has a huge number of execution units just
waiting for the chance to do something
useful for the program.
• If you can use a lot more of them at once,
you can get a lot more done per unit time!

Not just arithmetic
• For example, fetches from L1 cache take
two clock cycles . . .
• But fetching 32 bytes is no slower than
fetching 8 (or less), and you can issue two
of these every clock cycle
• So the throughput of loading 8-byte data is
sixteen times greater than the latency.

Much the same with memory
• The fetch of a (64-byte) cache-line on this
computer has a 180 clock-cycle latency.
• But it can usually do three at once
• and then you have 180 clock cycles to be
getting on with other work that does not
depend on those fetches.
• If, that is, you have something to do.

Example - database updates.
• When applying updates to a database,
particularly if it is distant (i.e. has a high
ping-time) it can be important to batch
them up and send off the whole batch at
once.
• This is an example of the “babies” concept
that is already mainstream - DTO.

Example - heap.
• A heap allows you to put a key/value pair in,
and take out the one with the smallest key.
• I could tell you about a version that runs
considerably faster. You need to put a few
pairs in, and take a few pairs (sorted) out at
each step.
• But even if you don't know (yet) how to do
that, it is sensible to do use the “babies”
interface anyway.
• Perhaps you'll think of something later.

Example - Hash table
• Here the dead time is waiting for a memory
fetch.
• It should be clear what the interface should
be. . .
• You look-up (insert, delete, whatever) many
things in one call. Give it a chance to be
clever and fast.

Example - GeoIndex
• I have very recently looked at how one
can find the nearest points in a GeoIndex.
• There are several slowish steps - memory
fetch, trig functions, distance
computations, sorting the results etc.
• Every one of these steps can benefit
greatly by handling multiple points rather
than just one.

Compilers good without babies.
• If one implements, say, the cosine function
(one-at-a-time interface) the code is
dependency bound anyway. The exact
algorithm wanted must be carefully coded
to minimize the dependency chain, but if
this is done in (say) “c”, it will run well.
• An assembler programmer will be hard
pressed to get more than a few percent
improvement.

“Babies” changes everything.
• If you want to write a “babies” version of
cosine, it is an uphill struggle to get a
compiler to generate good code.
• The use of vector instructions, the partial
loop unrolling, the use of conditional
moves rather than branch etc. etc. etc.
make the challenge too hard for even a
modern compiler.

Compilers may get better.
• Once programmers are asking more of
their compilers (by writing babies versions)
this may give a wake-up call.
• I would love to see a language which
helps me write top-performance code!
• As of today, however, the answer is often
to write the “leaf” babies routines in
assembler code.

“Ninja Assembler”
• Today's assemblers have changed little in
50 years. . .
• But it not hard to imagine a language
designed for writing high-performance x86
code, but still offering many of the luxuries
of a high-level language.
• One issue is portability. Portability is a big
obstacle to making best use of the
machine you actually have!

Portability problems.
• Laying out data for the vector instructions.
• Aligning parts of a data structure.
• Specializing types for vector use.
• When is a branch unpredictable?
• What is the width of the vector registers?
• When may a data fetch miss cache.
• Cramming data into a single cache-line.
• At the 5% level there are many more.

The 80/20 rule.
• Even for “peak performance” code, ~80%
of the time it is in ~20% of the code.
• We accept writing 20% in assembler
• But this determines the data layout.
• So it is very hard, at the moment anyway,
to write the 80% in a high-level language.
• Mainly because of portability!

Strategic view
• As of today, it seems to me that a properly
programmed x86 can outperform anything.
• What I am less clear about is whether the
software community can afford the extra
time and cost needed to program it
properly.
• My aim is to try to reduce that cost.

Summary
• Try to use “babies” interfaces at all times -
even if you can't immediately see how or
why you might want to make it faster
• As of today, a small, assembler, babies
routine can make a big difference to the
speed of a whole system.
• And we can hope for better software
development tools in future.

Software + Babies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Software + Babies

Similar to Software + Babies (20)

More from ArangoDB Database

More from ArangoDB Database (20)

Recently uploaded

Recently uploaded (20)

Software + Babies