WVKULAK13_submission_14

CALCULATING DEDEKIND NUMBERS WITH A 6-NODE
SUPERCOMPUTER
KULAK, KU Leuven, Kortrijk, Belgium
Daan Seynaeve, Max Dekoninck
Informatica
E-mail: daan.seynaeve@student.kuleuven.be, max.dekoninck@student.kuleuven.be
Abstract
This paper introduces a parallel version of an existing algorithm to compute Dedekind numbers. By
their nature, these numbers form a very rapidly increasing sequence, of which the exact values are only
known for the rst eight numbers. Test results of this algorithm on a 6-node supercomputer built with
Raspberry Pi circuit boards are presented. These results show that even when using a fairly simple ap-
proach, the speedup is quite signicant.
Introduction
Dedekind numbers, named after Richard
Dedekind, have been proven notoriously challenging
to calculate. However, the calculation of the eighth
Dedekind number within reasonable time bounds [1]
has led us to believe that it is benecial to exam-
ine how well existing procedures can be parallelized.
Especially if one should consider an attempt at calcu-
lating the next, to the date of writing, unknown 9th
number. To this end, we interconnected several Rasp-
berry Pi circuit boards, which are a very cheap, but
less powerful alternative to full-blown desktop com-
puters. We start o with an informal description of
the problem space, followed by a quick tour of the spe-
cic algorithm we used. We avoid getting into much
detail, but rather suggest [?] for a more rigorous de-
nition of the problem space and [1] for the algorithm.
We explain how we used message passing to divide
the workload in a balanced manner and present an
updated parallel version of the algorithm. Afterwards
we elaborate on the hardware we used (Raspberry Pi
Model B) and our network set-up. We provide a step-
by-step walktrough of our set-up, which we hope will
allow the reader to recreate the network. Lastly, we
report on our results, which, while not that impres-
sive in terms of speed themselves, are a clear indicator
that it pays o to execute the mentioned algorithm
in parallel.
I. Problem and Definitions
The problem was rst posed by Richard
Dedekind [2] and there are a number of dierent ways
in which it can be formulated. One of them is the fol-
lowing: consider a set N of elements, labeled 1,2,...,n
for simplicity. If we take a look at all the possible sub-
sets for this set, and select a number of these, such
a selection can have a particular property: if a set is
selected, no subset of it may be selected. Through-
out this paper, we refer to this kind of selection as
an Anti-Monotonic Function. An example for n =
3 is given in Figure 1, where the selected sets are
highlighted. The Dedekind number is the numerical
answer to the question: how many of these Anti-
Monotonic Functions are there?. With AMF(n) we
denote the set of Anti-Monotonic Functions with n el-
ements. Using this notation, the n'th Dedekind num-
ber is given by #AMF(n).
{1,2,3}
{1,2} {2,3}{1,3}
{1} {3} {2}
{}
Figure 1 Anti-Monotonic Function
Within AMF(n), 2 relations are particulary im-
portant. One is a partial order, which we denote with
≤:
Assume α, β ∈ AMF(n), then α ≤ β if every set of
α is a subset of some set in β.
The other one is an equivalence relation:
Assume α, β ∈ AMF(n), then α is equivalent to β if
there exists a permutation π of the elements of N,
such that β = {π(X)|X ∈ α}.
1

Algorithm 1 Calculating The (n + 2)th dedekind number by enumerating AMF(n)
Require: n ∈ N, n ≥ 0, N = {1, 2, ..., n}
function PC2-Dedekind(n+2)
Step 1:
Compute the set R of unique AMF's under permutation of the elements and calculate the sizes of the
corresponding equivalence classes as Count : R → N
for each element of R, calculate the size of the interval [∅, R] as L-Size : R → N and the size of the
interval [R, {N}] as R-Size : R → N.
Step 2:
sum ← 0
for r0 ∈ AMF(n) do
partialsum ← 0
for rN ∈ R, rN ≤ r0 do
partialsum ← partialsum + Count(rN ) ∗ P-Coeff(r0, rN ) ∗ L-Size(rN )
end for
sum ← sum +(partialsum ∗ R-Size(r0))
end for
return sum
end function
II. The Algorithm
The algorithm for calculating #AMF(m) for
some m is shown above as Algorithm 1. Once again,
we refer to [?] and [1] for a more accurate discussion
of this algorithm and the underlying concepts, such
as the intervals and P-Coecients. We did however
include it for comparison with the parallel version
we will introduce later on. The algorithm works
as follows: in the rst step, equivalence classes for
AMF(m − 2) are generated, each represented by one
of their elements. The equivalence relation used is the
one discussed earlier. For each representative, the size
of the corresponding equivalence class is calculated,
as well as the left and right interval sizes. In the sec-
ond step, this information is used to calculate the size
of AMF(m) with a double sum. The outer one sums
over AMF(m − 2), while the inner one sums over the
elements of R. An implementation of this algorithm
exists in Java. It has a good performance, being able
to calculate #AMF(8) in just 40 hours. It is worth
noting that the second step is computationally the
most time-consuming. The reason for this is that it
consists of calculating and summing a large number
of terms. For m = 7, #AMF(m − 2)= 7581 and
#R = 210, so we have to sum 7581 ∗ 210 = 1592010
terms. However, these terms can be calculated in-
dividually and summed together in groups. For the
remainder of this paper, we focus on doing exactly
that.
III. Message Passing Interface
One of the simplest approaches to parallel com-
putation, is the master-slave-paradigm. The main
computation branch, called the master, generates
subproblems and orders the slaves to solve these.
While there exist a great number of other paradigms,
we have chosen this simple one because we use a rel-
atively small, closely monitored cluster. To commu-
nicate between the nodes, we use MPI [4], which is
a well dened standard interface to interact between
dierent computers on a network via message passing.
Although it does give the programmer a library with
functions to communicate, the connection relies on
SSH therefore both an IP address and a password are
needed to connect one computer to another. Unfor-
tunately MPI can only be used with Python, Fortran
and C-languages.
The rst thing that is done in an MPI-based ap-
plication is initializing MPI and checking what pro-
cessing power is available. This usually means the
number of nodes in a network, or when executed on
a single machine, the amount of available cores. Be-
cause this can dier from time to time, a machine le
is included. The machine le contains all the IP ad-
dresses and the number of cores available at that ad-
dress. When the application is executed, MPI checks
which addresses in the machine le are accessible. Be-
cause of this, we have written the parallelized version
of the algorithm with a general number of processors
in mind.
2

Algorithm 2 MPI-embedded version of Step 2 of Algorithm 1
Require: q = rank of the process, m = amount of processes
Let ρ0, ..., ρs−1 denote the elements of R, thus s = #R
Step 2:
sum ← 0
master ← 0
for r0 ∈ AMF(n) do
partialsum ← 0
i ← q
while i s do
if ρi ≤ r0 then
partialsum ← partialsum + Count(rN ) ∗ P-Coeff(r0, rN ) ∗ L-Size(rN )
end if
i ← i + m
end while
sum ← sum +(partialsum ∗ R-Size(r0))
end for
Collect:
if q = master then
for slave ∈ {1, ..., m − 1} do
MPI_Recv(received, slave)
sum ← sum + received
end for
else (slave)
MPI_Send(sum, master)
end if
IV. MPI-Embedded Algorithm
We previously noted that the 2nd step of Al-
gorithm 1 is computationally the heaviest. For this
reason, we decided to focus on this step, which is es-
sentially a big for-loop. The common solution would
be to split the range into several parts and have each
Raspberry Pi calculate a partial sum, then add all of
these together. For a balanced division of the work-
load, we require parts that don't dier much in size.
In the outer loop we enumerate AMF(n), which is
done by iterating over the interval [∅, {N}]. Decom-
posing this interval directly into parts is possible[1],
but it is hard to generate equally sized parts. An-
other option is to let the master enumerate m el-
ements of AMF(n), have each of the m processors
evaluate the inner loop for one element, and continue
with enumerating the next m. This would introduce
communication overhead however, as the #AMF(n)
elements would have to be transmitted. To avoid this,
each processor could enumerate all the elements of
AMF(n), but only evaluate once every m times. In-
stead of focusing on the outer loop, we decided to par-
allelize the inner loop instead. This is a much simpler
task, since the range is the set R which we have cal-
culated in step 1. The drawback is that some work
is performed multiple times, like the enumeration of
AMF(n).
The question that remains is how to divide R.
Perhaps the simplest solution is to cut the set in m
parts, have one Pi evaluate the rst #R/m ele-
ments, another one the next #R/m ones, . . . and
so on. In our opinion, a more elegant solution would
be to interleave the evaluations as illustrated in 2.
}
{m
remainder
...
Figure 2 Division of R
This has the added benet that every process
evaluates for all kinds of elements of R, both from
the beginning and end. Should a particular group of
elements be easier to calculate, for example the ones
in the beginning, our approach would arguably pro-
duce a more balanced division of the workload. One
can imagine specic kinds of clustering where our ap-
proach performs very bad of course, but in most of
3

those cases adding or leaving out a processor to the
network would solve the problem. After each Rasp-
berry Pi has calculated its part of the sum, it sends it
to the master, which is responsible for adding the par-
tial sums together into the nal result: AMF(n + 2).
The master has to wait for the consecutive responses
from the slaves, so the total running time is the max-
imum of the individual running times. This leads to
Algorithm 2. In practice, our approach yields good
results in terms of balancing.
V. Configuration
The Raspberry Pi [3] is the favourable platform
to run the before mentioned algorithm. This mini
computer has a lot on board, see gure 3. The CPU
works at a frequency of 700MHz, this does not give
us much processing power. In order to achieve ex-
tra performance we can overclocked all the Pi's to
800MHz. The Pi's dynamically change their clock
speed, depending on the properties required. For ex-
ample while idling the clock speed is reduced back
to 700MHz. There are limitations regarding mem-
ory: we need to make sure our algorithm does not
take more than 512Mb RAM available. Each Pi has
an SD-card attached to it, which serves as internal
storage, much like the hard drive in a desktop. For
more information regarding our set-up it is possible
to proceed to appendix A which contains an in-depth
manual to recreate the complete process of building
the supercomputer.
Figure 3 Raspberry Pi
The set up consists of six Raspberry Pi's. The
Pi's are connected with a switch. We congure
a Pi and make an image, which we then clone to
the SD cards of the other Pi's. We use Rasp-
bian, a lightweight operating system based on De-
bian, a Linux distribution optimized for Raspberry
Pi's. With an internet connection we install MPI us-
ing the Advanced Packaging Tool. Finally we give
the Pi a static IP address. Once the rst Pi is cong-
ured we can easily clone the image to the SD-cards of
the other Pi's. This vastly reduces the time needed
to congure all the Pi's. The last task remaining is
changing all the static IP addresses on the other Pi's
and renaming them. Once this is all done we can
start running the algorithm. An example of the net-
work can be seen in gure 4.
Robb
~800Mhz
Jon
~800Mhz
Arya
~800Mhz
Sansa
~800Mhz
switch
100.0.0.11 100.0.0.12
100.0.0.13 100.0.0.14
Brandon
~800Mhz
Rickon
~800Mhz
100.0.0.15 100.0.0.16
Figure 4 Network
VI. Results
We were able to calculate values of #AMF(n)
up until n = 7. We mentioned before that in liter-
ature, #AMF(8) has been calculated with the same
algorithm we used. We dit not succeed at doing so,
and we suspect that there are multiple reasons for
this. Foremost, we implemented our version in C++
to be able to use MPI for message passing. Although
our code is an almost direct translation of the existing
Java implementation, it runs slower on the same hard-
ware. To calculate #AMF(7), the Java version takes
on average 5, 4 seconds on a Macbook Pro, whereas
the C++ version takes about 13, 2 seconds on the
same machine. It has been argued that languages of
the C-family are in general faster than Java. How-
ever, because we had no prior knowledge of C++
when we started this project, we suspect that our
code contains a lot of beginner mistakes, which might
be causing the decrease in performance. Secondly,
we used Raspberry Pi circuit boards to execute the
parallel algorithm, which are a lot less powerful than
your average laptop or desktop computer. A single
Raspberry Pi needs on average 119.92 seconds to cal-
culate #AMF(7), which is about 9 times slower than
our Macbook Pro.
4

0.02
0.025
0.03
0.035
0.04
0.045
0.05
1 2 3 4 5 6
Time(s)
Amount of Raspberry Pi’s
Runtime for n= 5
Data points
Average
0.0297
0.0426
0.0415
0.0404 0.0400 0.0400
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 3 4 5 6
Time(s)
Runtime for n= 6
Data points
Average
0.394
0.292
0.268
0.256 0.249
0.238
40
60
80
100
120
140
160
1 2 3 4 5 6
Time(s)
Runtime for n= 7
Data points
Average
119.9
76.9
61.8
54.1
50.1
46.6
Figure 5 Running Times (with MPI)
5

#Pi = 5 #Pi = 6
Average = 6.833 Average = 7.167
7 8
9 7
10 10
4 7
1 4
10 5
Table 1 Balance for n = 7 : T(last) − T(first) (seconds)
More interestingly however, is what happens
when more Pi's are added to the network. In g-
ure 5, we present measurements of the running times
for #AMF(5), #AMF(6) and #AMF(7) in function
of the amount of Raspberry Pi's used. The indicated
values are averages of multiple runs. A metric com-
monly used in the eld of parallel computing is the
speedup. It measures the relative performance in-
crease and is dened as follows:
Sp =
T1
Tp
(1)
where T1 is the total running time on a single
processor and Tp is the running time on p processors.
When we use 2 Raspberry Pi's to calculate #AMF(7),
for example, the running time decreases to 76.9 sec-
onds, which corresponds to a speedup S2 of about
1.56. When we increase p the speedup Sp increases
as well. If we use all six of the Pi's, it takes only 46.6
seconds (S6 ≈ 2, 57) to calculate #AMF(7). Note
that there is a lower bound to the running time be-
cause there is a portion of the work which has to be
performed by every node in the network. This in-
cludes the rst step of the algorithm, which on av-
erage takes about 4, 31 seconds for n = 7, and the
enumeration of AMF(n − 2). For #AMF(6), we nd
similar results but with a greater variance in the mea-
surements. For #AMF(5) however, we note that the
algorithm is faster when executed on a single node.
We suspect that this is because it takes more time
to send a message over the network than to calculate
#AMF(5).
We stated that we achieve good results in terms
of balancing. To measure this we have used the fol-
lowing metric:
Tlast − Tfirst (2)
where Tlast is the highest individual running time and
Tfirst the lowest individual running time. Results for
n = 7 this can be found in Table 1. Both of the aver-
ages are about 15% of the total running time, which
we consider to be a good result.
VII. Conclusion
We introduced a parallel version of an existing
algorithm. We divided the workload in a fairly simple
manner, but nevertheless obtained some interesting
results executing an implementation on a home-made
supercluster built with cheap materials. We con-
clude that the algorithm presented in [1] for calculat-
ing Dedekind numbers allows for easy parallelisation,
as was expected. Secondly, we believe that we have
shown that the enterprise of building a small-scale su-
percomputer with cheap components is a feasible and
rewarding one. That said, we did go trough some
frustration conguring everything, and hope that our
eorts writing the process down will prove useful to
people engaging in future, similar projects.
VIII. Further research
Given the simplicity of the approach used, fur-
ther investigation may prove useful as how to increase
the speedup even further. Right now a lot of work still
has to be performed by every node in the network.
Because everything was constructed using MPI and
a general number of processors in mind, our imple-
mentation remains entirely functional when executed
on dierent hardware, given that it is properly con-
gured. Other than that, we suggest for a refactoring
of the code. We suspect that the running times could
be drastically improved if the code were handed to
an experienced C++ programmer. Our code can be
found at: http://www.student.kuleuven.be/
~r0296224/RPI_AMF_2014/.
IX. Aknowledgements
We would like to thank Patrick De Causmaecker
and Stefan De Wannemacker for their assistance and
guidance. Furthermore we would like to thank the
Raspberry Pi foundation for making these kinds of
projects possible.
6

References
[1] Patrick De Causmaecker and Stefan De Wan-
nemacker. Decomposition of intervals in the space
of anti-monotonic functions. In Proceedings of
the Tenth International Conference on Concept
Lattices and Their Applications (CLA 2013), La
Rochelle, France, October 15-18, 2013., volume
1062, pages 5767, 2013.
[2] Geh Hofrath Dedekind. Über Zerlegungen
von Zahlen Durch Ihre Grössten Gemeinsamen
Theiler. Bücherei der Technische Hochschule
Braunschweig, 1897.
[3] Raspberry Pi Foundation. http://www.
raspberrypi.org, 2009.
[4] MPICH. Message passing interface. http://
www.mpich.org, 1992.
[5] Steven J. Johnston Mart Scott Neil S. O'Brien Si-
mon J. Cox, James T. Cox Richard P. Boardman.
Iridis-pi: a low-cost, compact demonstration clus-
ter. Springer US, 2013.
7

A. Manual
This is an in-depth guide which explains how we set-up the Message Passing Interface on multiple Rasp-
berry Pi's. To follow the step-by-step guide no prior knowledge is required. Those who can't be bothered
with the technical matters can easily omit this part.
1. First of all get an operating system for you Raspberry Pi. Head to the site http://www.raspberrypi.
org/downloads/ and grab one. Preferable Raspbian but any other Linux distribution will work.
2. Once the image is downloaded, with a fresh operating system, you need to install the image on the
SD-card. This can be done by downloading some tools to make a boot-able SD-card like PiWriter
(OSX) or Win32DiskImager (Windows).
3. After the image is installed, put the SD-card into the Pi and boot it. Now access to the internet is
required to install updates and other packages. There are some dierent approaches.
• Use a HDMI capable television or monitor and a keyboard. Plug in the network cable directly
from the router into the Pi.
• If there is a wireless connection present, it can be used for sharing your internet on the lan-port
of your computer. To do so on OSX go to: System Preferences → Sharing → Internet Sharing.
Make sure the Wi-Fi is shared to the correct port. After the internet is shared connect with ssh
to the Pi. In this way a keyboard and a screen is optional.
How to connect with SSH.
(a) First of all nd the IP address of the Pi. If internet sharing is used we can look in the routing ta-
ble which IP address the router has given to the Pi. Use the command netstat -rn -finet
to get access to the routing table. Now take a look at a bridge and gateway of the Pi. See
gure 3, the IP address of the Pi can be easily found, namely 192.168.2.2.
(b) With the IP address connect to the Pi using the command:
ssh -X pi@192.168.2.2. The -X ag is optional and only needed if you plan to use
the GUI of Raspbian. It allows to start an lxsession, in order to do so make sure Xterm is
installed.
(c) If the connection is successful, you receive a prompt asking for the password. The default
password is raspberry. On Windows, follow the same procedure using putty to establish the
ssh connection.
Figure 6 Routing table
4. Now we have access to our Pi which has an internet connection. Install updates and MPI. In order
to do so type in the console sudo apt-get update and sudo apt-get upgrade. This gives the
latest version of everything installed on the Pi. Next simply type sudo apt-get install mpich2
to install MPI on the Pi.
8

5. It is best to give the Pi a static IP address, to avoid extra diculties. Changing the IP-conguration
can be accomplished by going to /etc/network/interfaces. Open the le interfaces with
your favourite text-editor and administrator rights. Change the le in something like listing 1.
1 auto lo
iface lo inet loopback
iface eth0 inet static
address The IP address you would like to give to the Pi
6 netmask The netmask
broadcast broadcast IP address
gateway The gateway
allow-hotplug wlan0
11 iface wlan0 inet manual
wpa-roam /etc/wpa_supplicant/wpa_supplicant.conf
iface default inet dhcp
Listing 1 Interfaces
6. (optional) After that the name of our Pi can also be changed. To do so modify the les hosts and
hostname found in /etc.
7. Right now one Pi is completely congured. If several Pi's need to be congured, one could save a lot
of time by making a clone of our Pi and install it on the other SD-card. The code below in listing 2
shows how to make an image of a fully congured SD-card. There also exists some tools on the internet
to make an image of a SD-card.
Maxs-MacBook-Pro:~ maxdekoninck$ mkdir /raspberry-pi
2 Maxs-MacBook-Pro:~ maxdekoninck$ mkdir /raspberry-pi/backups
Maxs-MacBook-Pro:~ maxdekoninck$ cd /raspberry-pi/backups
Maxs-MacBook-Pro:backups maxdekoninck$ df-h
Maxs-MacBook-Pro:backups maxdekoninck$ diskutil unmount /dev/disk1s1
Maxs-MacBook-Pro:backups maxdekoninck$ dd if=/dev/rdisk1 of=/raspberry-pi/backups/wheezy-todaysdate-
backup.img bs=1m
7 Maxs-MacBook-Pro:backups maxdekoninck$ diskutil eject /dev/rdisk1
Listing 2 Making an image
8. At this time the image contains a clone of the rst Pi. The same procedure as described in step 2 can
be used to install the image.
9. The only thing left to do is change the static IP and (optional) the name. This is also described in step
5 and step 6.
10. Once all the Pi's are congured and connected to the switch it is possible to decide which Pi will be
the master. MPI relies on ssh, this means for each connection you get a prompt that asks for the user
and the password. The intend is to skip these steps by storing all the essential information encrypted
in a le. To get all the ssh-keys and store them in a le you can do the same as in listing 3. Afterwards
the keys for each Pi need to be added to the master as described in listing 4.
pi@robb ~ $ ssh-keygen -t rsa
Generating public/private rsa key pair.
3 Enter file in which to save the key (/home/pi/.ssh/id_rsa): (nothing)
/home/pi/.ssh/id_rsa already exists.
Overwrite (y/n)? (y)
Enter passphrase (empty for no passphrase): (nothing)
Enter same passphrase again: (nothing)
8 Your identification has been saved in /home/pi/.ssh/id_rsa.
Your public key has been saved in /home/pi/.ssh/id_rsa.pub.
The key fingerprint is:
c5:85:58:cd:53:93:3f:06:38:bf:f2:59:33:c5:07:2f pi@robb
The key’s randomart image is:
13 +--[ RSA 2048]----+
| o.+o.o. |
| ...++.o. |
| o o..= |
| . .Eo*|
18 | S o.+|
| . . + |
| o o o|
| o |
| |
23 +-----------------+
Listing 3 Make ssh key
9

pi@robb ~ $ ssh pi@100.0.0.12 mkdir -p .ssh
2 pi@100.0.0.12’s password: (raspberry)
pi@robb ~ $ cat .ssh/id_rsa.pub | ssh pi@100.0.0.12 ’cat .ssh/authorized_keys’
pi@100.0.0.12’s password: (raspberry)
Listing 4 Add ssh
11. Finally MPI can be tested on our system.
(a) Start o by making a machine le. This le contains all the IP addresses of the computers we
want to use MPI on. An example can be found in listing 11a.
100.0.0.11:1 = robb
100.0.0.12:1 = jon
100.0.0.13:1 = sansa
100.0.0.14:1 = arya
100.0.0.15:1 = brandon
100.0.0.16:1 = rickon
Table 2 machinele
(b) Next up, add the source code to each node. This can be easily be achieved with the code in 5.
1 pi@robb ~/temp $ scp -R folder pi@IP:destination
Listing 5 Copy
(c) Now each Pi has the code and we have the machinele. Eventually compile and run the code,
shown in listing 6. The -std=c++0x ag allows us to use range based for loops. Use compiler
optimization by changing the -o in -o3. This results in a faster run time but slower compile time.
pi@robb ~/temp $ mpicxx cppfile -std=c++0x -o outputfile
pi@robb ~/temp $ sudo mpiexec -np 2 -machinefile machinefile ./outputfile
Listing 6 Compile en run
10

WVKULAK13_submission_14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WVKULAK13_submission_14

Similar to WVKULAK13_submission_14 (20)

WVKULAK13_submission_14