SlideShare a Scribd company logo
1 of 10
Download to read offline
CALCULATING DEDEKIND NUMBERS WITH A 6-NODE
SUPERCOMPUTER
KULAK, KU Leuven, Kortrijk, Belgium
Daan Seynaeve, Max Dekoninck
Informatica
E-mail: daan.seynaeve@student.kuleuven.be, max.dekoninck@student.kuleuven.be
Abstract
This paper introduces a parallel version of an existing algorithm to compute Dedekind numbers. By
their nature, these numbers form a very rapidly increasing sequence, of which the exact values are only
known for the rst eight numbers. Test results of this algorithm on a 6-node supercomputer built with
Raspberry Pi circuit boards are presented. These results show that even when using a fairly simple ap-
proach, the speedup is quite signicant.
Introduction
Dedekind numbers, named after Richard
Dedekind, have been proven notoriously challenging
to calculate. However, the calculation of the eighth
Dedekind number within reasonable time bounds [1]
has led us to believe that it is benecial to exam-
ine how well existing procedures can be parallelized.
Especially if one should consider an attempt at calcu-
lating the next, to the date of writing, unknown 9th
number. To this end, we interconnected several Rasp-
berry Pi circuit boards, which are a very cheap, but
less powerful alternative to full-blown desktop com-
puters. We start o with an informal description of
the problem space, followed by a quick tour of the spe-
cic algorithm we used. We avoid getting into much
detail, but rather suggest [?] for a more rigorous de-
nition of the problem space and [1] for the algorithm.
We explain how we used message passing to divide
the workload in a balanced manner and present an
updated parallel version of the algorithm. Afterwards
we elaborate on the hardware we used (Raspberry Pi
Model B) and our network set-up. We provide a step-
by-step walktrough of our set-up, which we hope will
allow the reader to recreate the network. Lastly, we
report on our results, which, while not that impres-
sive in terms of speed themselves, are a clear indicator
that it pays o to execute the mentioned algorithm
in parallel.
I. Problem and Definitions
The problem was rst posed by Richard
Dedekind [2] and there are a number of dierent ways
in which it can be formulated. One of them is the fol-
lowing: consider a set N of elements, labeled 1,2,...,n
for simplicity. If we take a look at all the possible sub-
sets for this set, and select a number of these, such
a selection can have a particular property: if a set is
selected, no subset of it may be selected. Through-
out this paper, we refer to this kind of selection as
an Anti-Monotonic Function. An example for n =
3 is given in Figure 1, where the selected sets are
highlighted. The Dedekind number is the numerical
answer to the question: how many of these Anti-
Monotonic Functions are there?. With AMF(n) we
denote the set of Anti-Monotonic Functions with n el-
ements. Using this notation, the n'th Dedekind num-
ber is given by #AMF(n).
{1,2,3}
{1,2} {2,3}{1,3}
{1} {3} {2}
{}
Figure 1  Anti-Monotonic Function
Within AMF(n), 2 relations are particulary im-
portant. One is a partial order, which we denote with
≤:
Assume α, β ∈ AMF(n), then α ≤ β if every set of
α is a subset of some set in β.
The other one is an equivalence relation:
Assume α, β ∈ AMF(n), then α is equivalent to β if
there exists a permutation π of the elements of N,
such that β = {π(X)|X ∈ α}.
1
Algorithm 1 Calculating The (n + 2)th dedekind number by enumerating AMF(n)
Require: n ∈ N, n ≥ 0, N = {1, 2, ..., n}
function PC2-Dedekind(n+2)
Step 1:
Compute the set R of unique AMF's under permutation of the elements and calculate the sizes of the
corresponding equivalence classes as Count : R → N
for each element of R, calculate the size of the interval [∅, R] as L-Size : R → N and the size of the
interval [R, {N}] as R-Size : R → N.
Step 2:
sum ← 0
for r0 ∈ AMF(n) do
partialsum ← 0
for rN ∈ R, rN ≤ r0 do
partialsum ← partialsum + Count(rN ) ∗ P-Coeff(r0, rN ) ∗ L-Size(rN )
end for
sum ← sum +(partialsum ∗ R-Size(r0))
end for
return sum
end function
II. The Algorithm
The algorithm for calculating #AMF(m) for
some m is shown above as Algorithm 1. Once again,
we refer to [?] and [1] for a more accurate discussion
of this algorithm and the underlying concepts, such
as the intervals and P-Coecients. We did however
include it for comparison with the parallel version
we will introduce later on. The algorithm works
as follows: in the rst step, equivalence classes for
AMF(m − 2) are generated, each represented by one
of their elements. The equivalence relation used is the
one discussed earlier. For each representative, the size
of the corresponding equivalence class is calculated,
as well as the left and right interval sizes. In the sec-
ond step, this information is used to calculate the size
of AMF(m) with a double sum. The outer one sums
over AMF(m − 2), while the inner one sums over the
elements of R. An implementation of this algorithm
exists in Java. It has a good performance, being able
to calculate #AMF(8) in just 40 hours. It is worth
noting that the second step is computationally the
most time-consuming. The reason for this is that it
consists of calculating and summing a large number
of terms. For m = 7, #AMF(m − 2)= 7581 and
#R = 210, so we have to sum 7581 ∗ 210 = 1592010
terms. However, these terms can be calculated in-
dividually and summed together in groups. For the
remainder of this paper, we focus on doing exactly
that.
III. Message Passing Interface
One of the simplest approaches to parallel com-
putation, is the master-slave-paradigm. The main
computation branch, called the master, generates
subproblems and orders the slaves to solve these.
While there exist a great number of other paradigms,
we have chosen this simple one because we use a rel-
atively small, closely monitored cluster. To commu-
nicate between the nodes, we use MPI [4], which is
a well dened standard interface to interact between
dierent computers on a network via message passing.
Although it does give the programmer a library with
functions to communicate, the connection relies on
SSH therefore both an IP address and a password are
needed to connect one computer to another. Unfor-
tunately MPI can only be used with Python, Fortran
and C-languages.
The rst thing that is done in an MPI-based ap-
plication is initializing MPI and checking what pro-
cessing power is available. This usually means the
number of nodes in a network, or when executed on
a single machine, the amount of available cores. Be-
cause this can dier from time to time, a machine le
is included. The machine le contains all the IP ad-
dresses and the number of cores available at that ad-
dress. When the application is executed, MPI checks
which addresses in the machine le are accessible. Be-
cause of this, we have written the parallelized version
of the algorithm with a general number of processors
in mind.
2
Algorithm 2 MPI-embedded version of Step 2 of Algorithm 1
Require: q = rank of the process, m = amount of processes
Let ρ0, ..., ρs−1 denote the elements of R, thus s = #R
Step 2:
sum ← 0
master ← 0
for r0 ∈ AMF(n) do
partialsum ← 0
i ← q
while i  s do
if ρi ≤ r0 then
partialsum ← partialsum + Count(rN ) ∗ P-Coeff(r0, rN ) ∗ L-Size(rN )
end if
i ← i + m
end while
sum ← sum +(partialsum ∗ R-Size(r0))
end for
Collect:
if q = master then
for slave ∈ {1, ..., m − 1} do
MPI_Recv(received, slave)
sum ← sum + received
end for
else (slave)
MPI_Send(sum, master)
end if
IV. MPI-Embedded Algorithm
We previously noted that the 2nd step of Al-
gorithm 1 is computationally the heaviest. For this
reason, we decided to focus on this step, which is es-
sentially a big for-loop. The common solution would
be to split the range into several parts and have each
Raspberry Pi calculate a partial sum, then add all of
these together. For a balanced division of the work-
load, we require parts that don't dier much in size.
In the outer loop we enumerate AMF(n), which is
done by iterating over the interval [∅, {N}]. Decom-
posing this interval directly into parts is possible[1],
but it is hard to generate equally sized parts. An-
other option is to let the master enumerate m el-
ements of AMF(n), have each of the m processors
evaluate the inner loop for one element, and continue
with enumerating the next m. This would introduce
communication overhead however, as the #AMF(n)
elements would have to be transmitted. To avoid this,
each processor could enumerate all the elements of
AMF(n), but only evaluate once every m times. In-
stead of focusing on the outer loop, we decided to par-
allelize the inner loop instead. This is a much simpler
task, since the range is the set R which we have cal-
culated in step 1. The drawback is that some work
is performed multiple times, like the enumeration of
AMF(n).
The question that remains is how to divide R.
Perhaps the simplest solution is to cut the set in m
parts, have one Pi evaluate the rst #R/m ele-
ments, another one the next #R/m ones, . . . and
so on. In our opinion, a more elegant solution would
be to interleave the evaluations as illustrated in 2.
}
{m
remainder
...
Figure 2  Division of R
This has the added benet that every process
evaluates for all kinds of elements of R, both from
the beginning and end. Should a particular group of
elements be easier to calculate, for example the ones
in the beginning, our approach would arguably pro-
duce a more balanced division of the workload. One
can imagine specic kinds of clustering where our ap-
proach performs very bad of course, but in most of
3
those cases adding or leaving out a processor to the
network would solve the problem. After each Rasp-
berry Pi has calculated its part of the sum, it sends it
to the master, which is responsible for adding the par-
tial sums together into the nal result: AMF(n + 2).
The master has to wait for the consecutive responses
from the slaves, so the total running time is the max-
imum of the individual running times. This leads to
Algorithm 2. In practice, our approach yields good
results in terms of balancing.
V. Configuration
The Raspberry Pi [3] is the favourable platform
to run the before mentioned algorithm. This mini
computer has a lot on board, see gure 3. The CPU
works at a frequency of 700MHz, this does not give
us much processing power. In order to achieve ex-
tra performance we can overclocked all the Pi's to
800MHz. The Pi's dynamically change their clock
speed, depending on the properties required. For ex-
ample while idling the clock speed is reduced back
to 700MHz. There are limitations regarding mem-
ory: we need to make sure our algorithm does not
take more than 512Mb RAM available. Each Pi has
an SD-card attached to it, which serves as internal
storage, much like the hard drive in a desktop. For
more information regarding our set-up it is possible
to proceed to appendix A which contains an in-depth
manual to recreate the complete process of building
the supercomputer.
Figure 3  Raspberry Pi
The set up consists of six Raspberry Pi's. The
Pi's are connected with a switch. We congure
a Pi and make an image, which we then clone to
the SD cards of the other Pi's. We use Rasp-
bian, a lightweight operating system based on De-
bian, a Linux distribution optimized for Raspberry
Pi's. With an internet connection we install MPI us-
ing the Advanced Packaging Tool. Finally we give
the Pi a static IP address. Once the rst Pi is cong-
ured we can easily clone the image to the SD-cards of
the other Pi's. This vastly reduces the time needed
to congure all the Pi's. The last task remaining is
changing all the static IP addresses on the other Pi's
and renaming them. Once this is all done we can
start running the algorithm. An example of the net-
work can be seen in gure 4.
Robb
~800Mhz
Jon
~800Mhz
Arya
~800Mhz
Sansa
~800Mhz
switch
100.0.0.11 100.0.0.12
100.0.0.13 100.0.0.14
Brandon
~800Mhz
Rickon
~800Mhz
100.0.0.15 100.0.0.16
Figure 4  Network
VI. Results
We were able to calculate values of #AMF(n)
up until n = 7. We mentioned before that in liter-
ature, #AMF(8) has been calculated with the same
algorithm we used. We dit not succeed at doing so,
and we suspect that there are multiple reasons for
this. Foremost, we implemented our version in C++
to be able to use MPI for message passing. Although
our code is an almost direct translation of the existing
Java implementation, it runs slower on the same hard-
ware. To calculate #AMF(7), the Java version takes
on average 5, 4 seconds on a Macbook Pro, whereas
the C++ version takes about 13, 2 seconds on the
same machine. It has been argued that languages of
the C-family are in general faster than Java. How-
ever, because we had no prior knowledge of C++
when we started this project, we suspect that our
code contains a lot of beginner mistakes, which might
be causing the decrease in performance. Secondly,
we used Raspberry Pi circuit boards to execute the
parallel algorithm, which are a lot less powerful than
your average laptop or desktop computer. A single
Raspberry Pi needs on average 119.92 seconds to cal-
culate #AMF(7), which is about 9 times slower than
our Macbook Pro.
4
0.02
0.025
0.03
0.035
0.04
0.045
0.05
1 2 3 4 5 6
Time(s)
Amount of Raspberry Pi’s
Runtime for n= 5
Data points
Average
0.0297
0.0426
0.0415
0.0404 0.0400 0.0400
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 3 4 5 6
Time(s)
Amount of Raspberry Pi’s
Runtime for n= 6
Data points
Average
0.394
0.292
0.268
0.256 0.249
0.238
40
60
80
100
120
140
160
1 2 3 4 5 6
Time(s)
Amount of Raspberry Pi’s
Runtime for n= 7
Data points
Average
119.9
76.9
61.8
54.1
50.1
46.6
Figure 5  Running Times (with MPI)
5
#Pi = 5 #Pi = 6
Average = 6.833 Average = 7.167
7 8
9 7
10 10
4 7
1 4
10 5
Table 1  Balance for n = 7 : T(last) − T(first) (seconds)
More interestingly however, is what happens
when more Pi's are added to the network. In g-
ure 5, we present measurements of the running times
for #AMF(5), #AMF(6) and #AMF(7) in function
of the amount of Raspberry Pi's used. The indicated
values are averages of multiple runs. A metric com-
monly used in the eld of parallel computing is the
speedup. It measures the relative performance in-
crease and is dened as follows:
Sp =
T1
Tp
(1)
where T1 is the total running time on a single
processor and Tp is the running time on p processors.
When we use 2 Raspberry Pi's to calculate #AMF(7),
for example, the running time decreases to 76.9 sec-
onds, which corresponds to a speedup S2 of about
1.56. When we increase p the speedup Sp increases
as well. If we use all six of the Pi's, it takes only 46.6
seconds (S6 ≈ 2, 57) to calculate #AMF(7). Note
that there is a lower bound to the running time be-
cause there is a portion of the work which has to be
performed by every node in the network. This in-
cludes the rst step of the algorithm, which on av-
erage takes about 4, 31 seconds for n = 7, and the
enumeration of AMF(n − 2). For #AMF(6), we nd
similar results but with a greater variance in the mea-
surements. For #AMF(5) however, we note that the
algorithm is faster when executed on a single node.
We suspect that this is because it takes more time
to send a message over the network than to calculate
#AMF(5).
We stated that we achieve good results in terms
of balancing. To measure this we have used the fol-
lowing metric:
Tlast − Tfirst (2)
where Tlast is the highest individual running time and
Tfirst the lowest individual running time. Results for
n = 7 this can be found in Table 1. Both of the aver-
ages are about 15% of the total running time, which
we consider to be a good result.
VII. Conclusion
We introduced a parallel version of an existing
algorithm. We divided the workload in a fairly simple
manner, but nevertheless obtained some interesting
results executing an implementation on a home-made
supercluster built with cheap materials. We con-
clude that the algorithm presented in [1] for calculat-
ing Dedekind numbers allows for easy parallelisation,
as was expected. Secondly, we believe that we have
shown that the enterprise of building a small-scale su-
percomputer with cheap components is a feasible and
rewarding one. That said, we did go trough some
frustration conguring everything, and hope that our
eorts writing the process down will prove useful to
people engaging in future, similar projects.
VIII. Further research
Given the simplicity of the approach used, fur-
ther investigation may prove useful as how to increase
the speedup even further. Right now a lot of work still
has to be performed by every node in the network.
Because everything was constructed using MPI and
a general number of processors in mind, our imple-
mentation remains entirely functional when executed
on dierent hardware, given that it is properly con-
gured. Other than that, we suggest for a refactoring
of the code. We suspect that the running times could
be drastically improved if the code were handed to
an experienced C++ programmer. Our code can be
found at: http://www.student.kuleuven.be/
~r0296224/RPI_AMF_2014/.
IX. Aknowledgements
We would like to thank Patrick De Causmaecker
and Stefan De Wannemacker for their assistance and
guidance. Furthermore we would like to thank the
Raspberry Pi foundation for making these kinds of
projects possible.
6
References
[1] Patrick De Causmaecker and Stefan De Wan-
nemacker. Decomposition of intervals in the space
of anti-monotonic functions. In Proceedings of
the Tenth International Conference on Concept
Lattices and Their Applications (CLA 2013), La
Rochelle, France, October 15-18, 2013., volume
1062, pages 5767, 2013.
[2] Geh Hofrath Dedekind. Über Zerlegungen
von Zahlen Durch Ihre Grössten Gemeinsamen
Theiler. Bücherei der Technische Hochschule
Braunschweig, 1897.
[3] Raspberry Pi Foundation. http://www.
raspberrypi.org, 2009.
[4] MPICH. Message passing interface. http://
www.mpich.org, 1992.
[5] Steven J. Johnston Mart Scott Neil S. O'Brien Si-
mon J. Cox, James T. Cox Richard P. Boardman.
Iridis-pi: a low-cost, compact demonstration clus-
ter. Springer US, 2013.
7
A. Manual
This is an in-depth guide which explains how we set-up the Message Passing Interface on multiple Rasp-
berry Pi's. To follow the step-by-step guide no prior knowledge is required. Those who can't be bothered
with the technical matters can easily omit this part.
1. First of all get an operating system for you Raspberry Pi. Head to the site http://www.raspberrypi.
org/downloads/ and grab one. Preferable Raspbian but any other Linux distribution will work.
2. Once the image is downloaded, with a fresh operating system, you need to install the image on the
SD-card. This can be done by downloading some tools to make a boot-able SD-card like PiWriter
(OSX) or Win32DiskImager (Windows).
3. After the image is installed, put the SD-card into the Pi and boot it. Now access to the internet is
required to install updates and other packages. There are some dierent approaches.
• Use a HDMI capable television or monitor and a keyboard. Plug in the network cable directly
from the router into the Pi.
• If there is a wireless connection present, it can be used for sharing your internet on the lan-port
of your computer. To do so on OSX go to: System Preferences → Sharing → Internet Sharing.
Make sure the Wi-Fi is shared to the correct port. After the internet is shared connect with ssh
to the Pi. In this way a keyboard and a screen is optional.
How to connect with SSH.
(a) First of all nd the IP address of the Pi. If internet sharing is used we can look in the routing ta-
ble which IP address the router has given to the Pi. Use the command netstat -rn -finet
to get access to the routing table. Now take a look at a bridge and gateway of the Pi. See
gure 3, the IP address of the Pi can be easily found, namely 192.168.2.2.
(b) With the IP address connect to the Pi using the command:
ssh -X pi@192.168.2.2. The -X ag is optional and only needed if you plan to use
the GUI of Raspbian. It allows to start an lxsession, in order to do so make sure Xterm is
installed.
(c) If the connection is successful, you receive a prompt asking for the password. The default
password is raspberry. On Windows, follow the same procedure using putty to establish the
ssh connection.
Figure 6  Routing table
4. Now we have access to our Pi which has an internet connection. Install updates and MPI. In order
to do so type in the console sudo apt-get update and sudo apt-get upgrade. This gives the
latest version of everything installed on the Pi. Next simply type sudo apt-get install mpich2
to install MPI on the Pi.
8
5. It is best to give the Pi a static IP address, to avoid extra diculties. Changing the IP-conguration
can be accomplished by going to /etc/network/interfaces. Open the le interfaces with
your favourite text-editor and administrator rights. Change the le in something like listing 1.
1 auto lo
iface lo inet loopback
iface eth0 inet static
address The IP address you would like to give to the Pi
6 netmask The netmask
broadcast broadcast IP address
gateway The gateway
allow-hotplug wlan0
11 iface wlan0 inet manual
wpa-roam /etc/wpa_supplicant/wpa_supplicant.conf
iface default inet dhcp
Listing 1  Interfaces
6. (optional) After that the name of our Pi can also be changed. To do so modify the les hosts and
hostname found in /etc.
7. Right now one Pi is completely congured. If several Pi's need to be congured, one could save a lot
of time by making a clone of our Pi and install it on the other SD-card. The code below in listing 2
shows how to make an image of a fully congured SD-card. There also exists some tools on the internet
to make an image of a SD-card.
Maxs-MacBook-Pro:~ maxdekoninck$ mkdir /raspberry-pi
2 Maxs-MacBook-Pro:~ maxdekoninck$ mkdir /raspberry-pi/backups
Maxs-MacBook-Pro:~ maxdekoninck$ cd /raspberry-pi/backups
Maxs-MacBook-Pro:backups maxdekoninck$ df-h
Maxs-MacBook-Pro:backups maxdekoninck$ diskutil unmount /dev/disk1s1
Maxs-MacBook-Pro:backups maxdekoninck$ dd if=/dev/rdisk1 of=/raspberry-pi/backups/wheezy-todaysdate-
backup.img bs=1m
7 Maxs-MacBook-Pro:backups maxdekoninck$ diskutil eject /dev/rdisk1
Listing 2  Making an image
8. At this time the image contains a clone of the rst Pi. The same procedure as described in step 2 can
be used to install the image.
9. The only thing left to do is change the static IP and (optional) the name. This is also described in step
5 and step 6.
10. Once all the Pi's are congured and connected to the switch it is possible to decide which Pi will be
the master. MPI relies on ssh, this means for each connection you get a prompt that asks for the user
and the password. The intend is to skip these steps by storing all the essential information encrypted
in a le. To get all the ssh-keys and store them in a le you can do the same as in listing 3. Afterwards
the keys for each Pi need to be added to the master as described in listing 4.
pi@robb ~ $ ssh-keygen -t rsa
Generating public/private rsa key pair.
3 Enter file in which to save the key (/home/pi/.ssh/id_rsa): (nothing)
/home/pi/.ssh/id_rsa already exists.
Overwrite (y/n)? (y)
Enter passphrase (empty for no passphrase): (nothing)
Enter same passphrase again: (nothing)
8 Your identification has been saved in /home/pi/.ssh/id_rsa.
Your public key has been saved in /home/pi/.ssh/id_rsa.pub.
The key fingerprint is:
c5:85:58:cd:53:93:3f:06:38:bf:f2:59:33:c5:07:2f pi@robb
The key’s randomart image is:
13 +--[ RSA 2048]----+
| o.+o.o. |
| ...++.o. |
| o o..= |
| . .Eo*|
18 | S o.+|
| . . + |
| o o o|
| o |
| |
23 +-----------------+
Listing 3  Make ssh key
9
pi@robb ~ $ ssh pi@100.0.0.12 mkdir -p .ssh
2 pi@100.0.0.12’s password: (raspberry)
pi@robb ~ $ cat .ssh/id_rsa.pub | ssh pi@100.0.0.12 ’cat  .ssh/authorized_keys’
pi@100.0.0.12’s password: (raspberry)
Listing 4  Add ssh
11. Finally MPI can be tested on our system.
(a) Start o by making a machine le. This le contains all the IP addresses of the computers we
want to use MPI on. An example can be found in listing 11a.
100.0.0.11:1 = robb
100.0.0.12:1 = jon
100.0.0.13:1 = sansa
100.0.0.14:1 = arya
100.0.0.15:1 = brandon
100.0.0.16:1 = rickon
Table 2  machinele
(b) Next up, add the source code to each node. This can be easily be achieved with the code in 5.
1 pi@robb ~/temp $ scp -R folder pi@IP:destination
Listing 5  Copy
(c) Now each Pi has the code and we have the machinele. Eventually compile and run the code,
shown in listing 6. The -std=c++0x ag allows us to use range based for loops. Use compiler
optimization by changing the -o in -o3. This results in a faster run time but slower compile time.
pi@robb ~/temp $ mpicxx cppfile -std=c++0x -o outputfile
pi@robb ~/temp $ sudo mpiexec -np 2 -machinefile machinefile ./outputfile
Listing 6  Compile en run
10

More Related Content

What's hot

Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel AlgorithmsHeman Pathak
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 
fft using labview
fft using labviewfft using labview
fft using labviewkiranrockz
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksMumtaz Hannah Vauhkonen
 
Data Structures - Lecture 8 - Study Notes
Data Structures - Lecture 8 - Study NotesData Structures - Lecture 8 - Study Notes
Data Structures - Lecture 8 - Study NotesHaitham El-Ghareeb
 
DIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPEDDIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPEDAjay Kumar
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesHeman Pathak
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
 
Cost optimal algorithm
Cost optimal algorithmCost optimal algorithm
Cost optimal algorithmHeman Pathak
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepikaguest1f4fb3
 
Design of FFT Processor
Design of FFT ProcessorDesign of FFT Processor
Design of FFT ProcessorRohit Singh
 
Basic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV SyllabusBasic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV SyllabusNANDINI SHARMA
 
Joint Timing and Frequency Synchronization in OFDM
Joint Timing and Frequency Synchronization in OFDMJoint Timing and Frequency Synchronization in OFDM
Joint Timing and Frequency Synchronization in OFDMidescitation
 
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...cscpconf
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...csandit
 
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsIOSR Journals
 

What's hot (20)

Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel algorithms
Parallel algorithms Parallel algorithms
Parallel algorithms
 
fft using labview
fft using labviewfft using labview
fft using labview
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
 
Data Structures - Lecture 8 - Study Notes
Data Structures - Lecture 8 - Study NotesData Structures - Lecture 8 - Study Notes
Data Structures - Lecture 8 - Study Notes
 
DIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPEDDIT-Radix-2-FFT in SPED
DIT-Radix-2-FFT in SPED
 
Chapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming LanguagesChapter 4: Parallel Programming Languages
Chapter 4: Parallel Programming Languages
 
Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
 
Cost optimal algorithm
Cost optimal algorithmCost optimal algorithm
Cost optimal algorithm
 
PRAM algorithms from deepika
PRAM algorithms from deepikaPRAM algorithms from deepika
PRAM algorithms from deepika
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Design of FFT Processor
Design of FFT ProcessorDesign of FFT Processor
Design of FFT Processor
 
Basic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV SyllabusBasic Computer Engineering Unit II as per RGPV Syllabus
Basic Computer Engineering Unit II as per RGPV Syllabus
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
Joint Timing and Frequency Synchronization in OFDM
Joint Timing and Frequency Synchronization in OFDMJoint Timing and Frequency Synchronization in OFDM
Joint Timing and Frequency Synchronization in OFDM
 
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
PERFORMANCE EVALUATIONS OF GRIORYAN FFT AND COOLEY-TUKEY FFT ONTO XILINX VIRT...
 
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
Performance evaluations of grioryan fft and cooley tukey fft onto xilinx virt...
 
Parallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic MathematicsParallel Hardware Implementation of Convolution using Vedic Mathematics
Parallel Hardware Implementation of Convolution using Vedic Mathematics
 
Dsp manual
Dsp manualDsp manual
Dsp manual
 

Similar to WVKULAK13_submission_14

TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
TIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMSTIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMS
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMSTanya Makkar
 
complexity analysis.pdf
complexity analysis.pdfcomplexity analysis.pdf
complexity analysis.pdfpasinduneshan
 
VCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxVCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxskilljiolms
 
One More Comments on Programming with Big Number Library in Scientific Computing
One More Comments on Programming with Big Number Library in Scientific ComputingOne More Comments on Programming with Big Number Library in Scientific Computing
One More Comments on Programming with Big Number Library in Scientific Computingtheijes
 
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...2022cspaawan12556
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Lino Possamai
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexityAnkit Katiyar
 
Data Structures Notes
Data Structures NotesData Structures Notes
Data Structures NotesRobinRohit2
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012Jussara F.M.
 
Introduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxIntroduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxPJS KUMAR
 
Algorithm Analysis.pdf
Algorithm Analysis.pdfAlgorithm Analysis.pdf
Algorithm Analysis.pdfMemMem25
 
ALGORITHMS - SHORT NOTES
ALGORITHMS - SHORT NOTESALGORITHMS - SHORT NOTES
ALGORITHMS - SHORT NOTESsuthi
 

Similar to WVKULAK13_submission_14 (20)

TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
TIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMSTIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMS
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
 
complexity analysis.pdf
complexity analysis.pdfcomplexity analysis.pdf
complexity analysis.pdf
 
VCE Unit 01 (2).pptx
VCE Unit 01 (2).pptxVCE Unit 01 (2).pptx
VCE Unit 01 (2).pptx
 
DATA STRUCTURE.pdf
DATA STRUCTURE.pdfDATA STRUCTURE.pdf
DATA STRUCTURE.pdf
 
DATA STRUCTURE
DATA STRUCTUREDATA STRUCTURE
DATA STRUCTURE
 
One More Comments on Programming with Big Number Library in Scientific Computing
One More Comments on Programming with Big Number Library in Scientific ComputingOne More Comments on Programming with Big Number Library in Scientific Computing
One More Comments on Programming with Big Number Library in Scientific Computing
 
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...DSA Complexity.pptx   What is Complexity Analysis? What is the need for Compl...
DSA Complexity.pptx What is Complexity Analysis? What is the need for Compl...
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
 
Time and space complexity
Time and space complexityTime and space complexity
Time and space complexity
 
Data Structures Notes
Data Structures NotesData Structures Notes
Data Structures Notes
 
Rapport_Cemracs2012
Rapport_Cemracs2012Rapport_Cemracs2012
Rapport_Cemracs2012
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Introduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptxIntroduction to data structures and complexity.pptx
Introduction to data structures and complexity.pptx
 
Daa chapter 1
Daa chapter 1Daa chapter 1
Daa chapter 1
 
Ada notes
Ada notesAda notes
Ada notes
 
Algorithm Analysis.pdf
Algorithm Analysis.pdfAlgorithm Analysis.pdf
Algorithm Analysis.pdf
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
Design & Analysis Of Algorithm
Design & Analysis Of AlgorithmDesign & Analysis Of Algorithm
Design & Analysis Of Algorithm
 
Mechanical Engineering Homework Help
Mechanical Engineering Homework HelpMechanical Engineering Homework Help
Mechanical Engineering Homework Help
 
ALGORITHMS - SHORT NOTES
ALGORITHMS - SHORT NOTESALGORITHMS - SHORT NOTES
ALGORITHMS - SHORT NOTES
 

WVKULAK13_submission_14

  • 1. CALCULATING DEDEKIND NUMBERS WITH A 6-NODE SUPERCOMPUTER KULAK, KU Leuven, Kortrijk, Belgium Daan Seynaeve, Max Dekoninck Informatica E-mail: daan.seynaeve@student.kuleuven.be, max.dekoninck@student.kuleuven.be Abstract This paper introduces a parallel version of an existing algorithm to compute Dedekind numbers. By their nature, these numbers form a very rapidly increasing sequence, of which the exact values are only known for the rst eight numbers. Test results of this algorithm on a 6-node supercomputer built with Raspberry Pi circuit boards are presented. These results show that even when using a fairly simple ap- proach, the speedup is quite signicant. Introduction Dedekind numbers, named after Richard Dedekind, have been proven notoriously challenging to calculate. However, the calculation of the eighth Dedekind number within reasonable time bounds [1] has led us to believe that it is benecial to exam- ine how well existing procedures can be parallelized. Especially if one should consider an attempt at calcu- lating the next, to the date of writing, unknown 9th number. To this end, we interconnected several Rasp- berry Pi circuit boards, which are a very cheap, but less powerful alternative to full-blown desktop com- puters. We start o with an informal description of the problem space, followed by a quick tour of the spe- cic algorithm we used. We avoid getting into much detail, but rather suggest [?] for a more rigorous de- nition of the problem space and [1] for the algorithm. We explain how we used message passing to divide the workload in a balanced manner and present an updated parallel version of the algorithm. Afterwards we elaborate on the hardware we used (Raspberry Pi Model B) and our network set-up. We provide a step- by-step walktrough of our set-up, which we hope will allow the reader to recreate the network. Lastly, we report on our results, which, while not that impres- sive in terms of speed themselves, are a clear indicator that it pays o to execute the mentioned algorithm in parallel. I. Problem and Definitions The problem was rst posed by Richard Dedekind [2] and there are a number of dierent ways in which it can be formulated. One of them is the fol- lowing: consider a set N of elements, labeled 1,2,...,n for simplicity. If we take a look at all the possible sub- sets for this set, and select a number of these, such a selection can have a particular property: if a set is selected, no subset of it may be selected. Through- out this paper, we refer to this kind of selection as an Anti-Monotonic Function. An example for n = 3 is given in Figure 1, where the selected sets are highlighted. The Dedekind number is the numerical answer to the question: how many of these Anti- Monotonic Functions are there?. With AMF(n) we denote the set of Anti-Monotonic Functions with n el- ements. Using this notation, the n'th Dedekind num- ber is given by #AMF(n). {1,2,3} {1,2} {2,3}{1,3} {1} {3} {2} {} Figure 1 Anti-Monotonic Function Within AMF(n), 2 relations are particulary im- portant. One is a partial order, which we denote with ≤: Assume α, β ∈ AMF(n), then α ≤ β if every set of α is a subset of some set in β. The other one is an equivalence relation: Assume α, β ∈ AMF(n), then α is equivalent to β if there exists a permutation π of the elements of N, such that β = {π(X)|X ∈ α}. 1
  • 2. Algorithm 1 Calculating The (n + 2)th dedekind number by enumerating AMF(n) Require: n ∈ N, n ≥ 0, N = {1, 2, ..., n} function PC2-Dedekind(n+2) Step 1: Compute the set R of unique AMF's under permutation of the elements and calculate the sizes of the corresponding equivalence classes as Count : R → N for each element of R, calculate the size of the interval [∅, R] as L-Size : R → N and the size of the interval [R, {N}] as R-Size : R → N. Step 2: sum ← 0 for r0 ∈ AMF(n) do partialsum ← 0 for rN ∈ R, rN ≤ r0 do partialsum ← partialsum + Count(rN ) ∗ P-Coeff(r0, rN ) ∗ L-Size(rN ) end for sum ← sum +(partialsum ∗ R-Size(r0)) end for return sum end function II. The Algorithm The algorithm for calculating #AMF(m) for some m is shown above as Algorithm 1. Once again, we refer to [?] and [1] for a more accurate discussion of this algorithm and the underlying concepts, such as the intervals and P-Coecients. We did however include it for comparison with the parallel version we will introduce later on. The algorithm works as follows: in the rst step, equivalence classes for AMF(m − 2) are generated, each represented by one of their elements. The equivalence relation used is the one discussed earlier. For each representative, the size of the corresponding equivalence class is calculated, as well as the left and right interval sizes. In the sec- ond step, this information is used to calculate the size of AMF(m) with a double sum. The outer one sums over AMF(m − 2), while the inner one sums over the elements of R. An implementation of this algorithm exists in Java. It has a good performance, being able to calculate #AMF(8) in just 40 hours. It is worth noting that the second step is computationally the most time-consuming. The reason for this is that it consists of calculating and summing a large number of terms. For m = 7, #AMF(m − 2)= 7581 and #R = 210, so we have to sum 7581 ∗ 210 = 1592010 terms. However, these terms can be calculated in- dividually and summed together in groups. For the remainder of this paper, we focus on doing exactly that. III. Message Passing Interface One of the simplest approaches to parallel com- putation, is the master-slave-paradigm. The main computation branch, called the master, generates subproblems and orders the slaves to solve these. While there exist a great number of other paradigms, we have chosen this simple one because we use a rel- atively small, closely monitored cluster. To commu- nicate between the nodes, we use MPI [4], which is a well dened standard interface to interact between dierent computers on a network via message passing. Although it does give the programmer a library with functions to communicate, the connection relies on SSH therefore both an IP address and a password are needed to connect one computer to another. Unfor- tunately MPI can only be used with Python, Fortran and C-languages. The rst thing that is done in an MPI-based ap- plication is initializing MPI and checking what pro- cessing power is available. This usually means the number of nodes in a network, or when executed on a single machine, the amount of available cores. Be- cause this can dier from time to time, a machine le is included. The machine le contains all the IP ad- dresses and the number of cores available at that ad- dress. When the application is executed, MPI checks which addresses in the machine le are accessible. Be- cause of this, we have written the parallelized version of the algorithm with a general number of processors in mind. 2
  • 3. Algorithm 2 MPI-embedded version of Step 2 of Algorithm 1 Require: q = rank of the process, m = amount of processes Let ρ0, ..., ρs−1 denote the elements of R, thus s = #R Step 2: sum ← 0 master ← 0 for r0 ∈ AMF(n) do partialsum ← 0 i ← q while i s do if ρi ≤ r0 then partialsum ← partialsum + Count(rN ) ∗ P-Coeff(r0, rN ) ∗ L-Size(rN ) end if i ← i + m end while sum ← sum +(partialsum ∗ R-Size(r0)) end for Collect: if q = master then for slave ∈ {1, ..., m − 1} do MPI_Recv(received, slave) sum ← sum + received end for else (slave) MPI_Send(sum, master) end if IV. MPI-Embedded Algorithm We previously noted that the 2nd step of Al- gorithm 1 is computationally the heaviest. For this reason, we decided to focus on this step, which is es- sentially a big for-loop. The common solution would be to split the range into several parts and have each Raspberry Pi calculate a partial sum, then add all of these together. For a balanced division of the work- load, we require parts that don't dier much in size. In the outer loop we enumerate AMF(n), which is done by iterating over the interval [∅, {N}]. Decom- posing this interval directly into parts is possible[1], but it is hard to generate equally sized parts. An- other option is to let the master enumerate m el- ements of AMF(n), have each of the m processors evaluate the inner loop for one element, and continue with enumerating the next m. This would introduce communication overhead however, as the #AMF(n) elements would have to be transmitted. To avoid this, each processor could enumerate all the elements of AMF(n), but only evaluate once every m times. In- stead of focusing on the outer loop, we decided to par- allelize the inner loop instead. This is a much simpler task, since the range is the set R which we have cal- culated in step 1. The drawback is that some work is performed multiple times, like the enumeration of AMF(n). The question that remains is how to divide R. Perhaps the simplest solution is to cut the set in m parts, have one Pi evaluate the rst #R/m ele- ments, another one the next #R/m ones, . . . and so on. In our opinion, a more elegant solution would be to interleave the evaluations as illustrated in 2. } {m remainder ... Figure 2 Division of R This has the added benet that every process evaluates for all kinds of elements of R, both from the beginning and end. Should a particular group of elements be easier to calculate, for example the ones in the beginning, our approach would arguably pro- duce a more balanced division of the workload. One can imagine specic kinds of clustering where our ap- proach performs very bad of course, but in most of 3
  • 4. those cases adding or leaving out a processor to the network would solve the problem. After each Rasp- berry Pi has calculated its part of the sum, it sends it to the master, which is responsible for adding the par- tial sums together into the nal result: AMF(n + 2). The master has to wait for the consecutive responses from the slaves, so the total running time is the max- imum of the individual running times. This leads to Algorithm 2. In practice, our approach yields good results in terms of balancing. V. Configuration The Raspberry Pi [3] is the favourable platform to run the before mentioned algorithm. This mini computer has a lot on board, see gure 3. The CPU works at a frequency of 700MHz, this does not give us much processing power. In order to achieve ex- tra performance we can overclocked all the Pi's to 800MHz. The Pi's dynamically change their clock speed, depending on the properties required. For ex- ample while idling the clock speed is reduced back to 700MHz. There are limitations regarding mem- ory: we need to make sure our algorithm does not take more than 512Mb RAM available. Each Pi has an SD-card attached to it, which serves as internal storage, much like the hard drive in a desktop. For more information regarding our set-up it is possible to proceed to appendix A which contains an in-depth manual to recreate the complete process of building the supercomputer. Figure 3 Raspberry Pi The set up consists of six Raspberry Pi's. The Pi's are connected with a switch. We congure a Pi and make an image, which we then clone to the SD cards of the other Pi's. We use Rasp- bian, a lightweight operating system based on De- bian, a Linux distribution optimized for Raspberry Pi's. With an internet connection we install MPI us- ing the Advanced Packaging Tool. Finally we give the Pi a static IP address. Once the rst Pi is cong- ured we can easily clone the image to the SD-cards of the other Pi's. This vastly reduces the time needed to congure all the Pi's. The last task remaining is changing all the static IP addresses on the other Pi's and renaming them. Once this is all done we can start running the algorithm. An example of the net- work can be seen in gure 4. Robb ~800Mhz Jon ~800Mhz Arya ~800Mhz Sansa ~800Mhz switch 100.0.0.11 100.0.0.12 100.0.0.13 100.0.0.14 Brandon ~800Mhz Rickon ~800Mhz 100.0.0.15 100.0.0.16 Figure 4 Network VI. Results We were able to calculate values of #AMF(n) up until n = 7. We mentioned before that in liter- ature, #AMF(8) has been calculated with the same algorithm we used. We dit not succeed at doing so, and we suspect that there are multiple reasons for this. Foremost, we implemented our version in C++ to be able to use MPI for message passing. Although our code is an almost direct translation of the existing Java implementation, it runs slower on the same hard- ware. To calculate #AMF(7), the Java version takes on average 5, 4 seconds on a Macbook Pro, whereas the C++ version takes about 13, 2 seconds on the same machine. It has been argued that languages of the C-family are in general faster than Java. How- ever, because we had no prior knowledge of C++ when we started this project, we suspect that our code contains a lot of beginner mistakes, which might be causing the decrease in performance. Secondly, we used Raspberry Pi circuit boards to execute the parallel algorithm, which are a lot less powerful than your average laptop or desktop computer. A single Raspberry Pi needs on average 119.92 seconds to cal- culate #AMF(7), which is about 9 times slower than our Macbook Pro. 4
  • 5. 0.02 0.025 0.03 0.035 0.04 0.045 0.05 1 2 3 4 5 6 Time(s) Amount of Raspberry Pi’s Runtime for n= 5 Data points Average 0.0297 0.0426 0.0415 0.0404 0.0400 0.0400 0.2 0.25 0.3 0.35 0.4 0.45 0.5 1 2 3 4 5 6 Time(s) Amount of Raspberry Pi’s Runtime for n= 6 Data points Average 0.394 0.292 0.268 0.256 0.249 0.238 40 60 80 100 120 140 160 1 2 3 4 5 6 Time(s) Amount of Raspberry Pi’s Runtime for n= 7 Data points Average 119.9 76.9 61.8 54.1 50.1 46.6 Figure 5 Running Times (with MPI) 5
  • 6. #Pi = 5 #Pi = 6 Average = 6.833 Average = 7.167 7 8 9 7 10 10 4 7 1 4 10 5 Table 1 Balance for n = 7 : T(last) − T(first) (seconds) More interestingly however, is what happens when more Pi's are added to the network. In g- ure 5, we present measurements of the running times for #AMF(5), #AMF(6) and #AMF(7) in function of the amount of Raspberry Pi's used. The indicated values are averages of multiple runs. A metric com- monly used in the eld of parallel computing is the speedup. It measures the relative performance in- crease and is dened as follows: Sp = T1 Tp (1) where T1 is the total running time on a single processor and Tp is the running time on p processors. When we use 2 Raspberry Pi's to calculate #AMF(7), for example, the running time decreases to 76.9 sec- onds, which corresponds to a speedup S2 of about 1.56. When we increase p the speedup Sp increases as well. If we use all six of the Pi's, it takes only 46.6 seconds (S6 ≈ 2, 57) to calculate #AMF(7). Note that there is a lower bound to the running time be- cause there is a portion of the work which has to be performed by every node in the network. This in- cludes the rst step of the algorithm, which on av- erage takes about 4, 31 seconds for n = 7, and the enumeration of AMF(n − 2). For #AMF(6), we nd similar results but with a greater variance in the mea- surements. For #AMF(5) however, we note that the algorithm is faster when executed on a single node. We suspect that this is because it takes more time to send a message over the network than to calculate #AMF(5). We stated that we achieve good results in terms of balancing. To measure this we have used the fol- lowing metric: Tlast − Tfirst (2) where Tlast is the highest individual running time and Tfirst the lowest individual running time. Results for n = 7 this can be found in Table 1. Both of the aver- ages are about 15% of the total running time, which we consider to be a good result. VII. Conclusion We introduced a parallel version of an existing algorithm. We divided the workload in a fairly simple manner, but nevertheless obtained some interesting results executing an implementation on a home-made supercluster built with cheap materials. We con- clude that the algorithm presented in [1] for calculat- ing Dedekind numbers allows for easy parallelisation, as was expected. Secondly, we believe that we have shown that the enterprise of building a small-scale su- percomputer with cheap components is a feasible and rewarding one. That said, we did go trough some frustration conguring everything, and hope that our eorts writing the process down will prove useful to people engaging in future, similar projects. VIII. Further research Given the simplicity of the approach used, fur- ther investigation may prove useful as how to increase the speedup even further. Right now a lot of work still has to be performed by every node in the network. Because everything was constructed using MPI and a general number of processors in mind, our imple- mentation remains entirely functional when executed on dierent hardware, given that it is properly con- gured. Other than that, we suggest for a refactoring of the code. We suspect that the running times could be drastically improved if the code were handed to an experienced C++ programmer. Our code can be found at: http://www.student.kuleuven.be/ ~r0296224/RPI_AMF_2014/. IX. Aknowledgements We would like to thank Patrick De Causmaecker and Stefan De Wannemacker for their assistance and guidance. Furthermore we would like to thank the Raspberry Pi foundation for making these kinds of projects possible. 6
  • 7. References [1] Patrick De Causmaecker and Stefan De Wan- nemacker. Decomposition of intervals in the space of anti-monotonic functions. In Proceedings of the Tenth International Conference on Concept Lattices and Their Applications (CLA 2013), La Rochelle, France, October 15-18, 2013., volume 1062, pages 5767, 2013. [2] Geh Hofrath Dedekind. Über Zerlegungen von Zahlen Durch Ihre Grössten Gemeinsamen Theiler. Bücherei der Technische Hochschule Braunschweig, 1897. [3] Raspberry Pi Foundation. http://www. raspberrypi.org, 2009. [4] MPICH. Message passing interface. http:// www.mpich.org, 1992. [5] Steven J. Johnston Mart Scott Neil S. O'Brien Si- mon J. Cox, James T. Cox Richard P. Boardman. Iridis-pi: a low-cost, compact demonstration clus- ter. Springer US, 2013. 7
  • 8. A. Manual This is an in-depth guide which explains how we set-up the Message Passing Interface on multiple Rasp- berry Pi's. To follow the step-by-step guide no prior knowledge is required. Those who can't be bothered with the technical matters can easily omit this part. 1. First of all get an operating system for you Raspberry Pi. Head to the site http://www.raspberrypi. org/downloads/ and grab one. Preferable Raspbian but any other Linux distribution will work. 2. Once the image is downloaded, with a fresh operating system, you need to install the image on the SD-card. This can be done by downloading some tools to make a boot-able SD-card like PiWriter (OSX) or Win32DiskImager (Windows). 3. After the image is installed, put the SD-card into the Pi and boot it. Now access to the internet is required to install updates and other packages. There are some dierent approaches. • Use a HDMI capable television or monitor and a keyboard. Plug in the network cable directly from the router into the Pi. • If there is a wireless connection present, it can be used for sharing your internet on the lan-port of your computer. To do so on OSX go to: System Preferences → Sharing → Internet Sharing. Make sure the Wi-Fi is shared to the correct port. After the internet is shared connect with ssh to the Pi. In this way a keyboard and a screen is optional. How to connect with SSH. (a) First of all nd the IP address of the Pi. If internet sharing is used we can look in the routing ta- ble which IP address the router has given to the Pi. Use the command netstat -rn -finet to get access to the routing table. Now take a look at a bridge and gateway of the Pi. See gure 3, the IP address of the Pi can be easily found, namely 192.168.2.2. (b) With the IP address connect to the Pi using the command: ssh -X pi@192.168.2.2. The -X ag is optional and only needed if you plan to use the GUI of Raspbian. It allows to start an lxsession, in order to do so make sure Xterm is installed. (c) If the connection is successful, you receive a prompt asking for the password. The default password is raspberry. On Windows, follow the same procedure using putty to establish the ssh connection. Figure 6 Routing table 4. Now we have access to our Pi which has an internet connection. Install updates and MPI. In order to do so type in the console sudo apt-get update and sudo apt-get upgrade. This gives the latest version of everything installed on the Pi. Next simply type sudo apt-get install mpich2 to install MPI on the Pi. 8
  • 9. 5. It is best to give the Pi a static IP address, to avoid extra diculties. Changing the IP-conguration can be accomplished by going to /etc/network/interfaces. Open the le interfaces with your favourite text-editor and administrator rights. Change the le in something like listing 1. 1 auto lo iface lo inet loopback iface eth0 inet static address The IP address you would like to give to the Pi 6 netmask The netmask broadcast broadcast IP address gateway The gateway allow-hotplug wlan0 11 iface wlan0 inet manual wpa-roam /etc/wpa_supplicant/wpa_supplicant.conf iface default inet dhcp Listing 1 Interfaces 6. (optional) After that the name of our Pi can also be changed. To do so modify the les hosts and hostname found in /etc. 7. Right now one Pi is completely congured. If several Pi's need to be congured, one could save a lot of time by making a clone of our Pi and install it on the other SD-card. The code below in listing 2 shows how to make an image of a fully congured SD-card. There also exists some tools on the internet to make an image of a SD-card. Maxs-MacBook-Pro:~ maxdekoninck$ mkdir /raspberry-pi 2 Maxs-MacBook-Pro:~ maxdekoninck$ mkdir /raspberry-pi/backups Maxs-MacBook-Pro:~ maxdekoninck$ cd /raspberry-pi/backups Maxs-MacBook-Pro:backups maxdekoninck$ df-h Maxs-MacBook-Pro:backups maxdekoninck$ diskutil unmount /dev/disk1s1 Maxs-MacBook-Pro:backups maxdekoninck$ dd if=/dev/rdisk1 of=/raspberry-pi/backups/wheezy-todaysdate- backup.img bs=1m 7 Maxs-MacBook-Pro:backups maxdekoninck$ diskutil eject /dev/rdisk1 Listing 2 Making an image 8. At this time the image contains a clone of the rst Pi. The same procedure as described in step 2 can be used to install the image. 9. The only thing left to do is change the static IP and (optional) the name. This is also described in step 5 and step 6. 10. Once all the Pi's are congured and connected to the switch it is possible to decide which Pi will be the master. MPI relies on ssh, this means for each connection you get a prompt that asks for the user and the password. The intend is to skip these steps by storing all the essential information encrypted in a le. To get all the ssh-keys and store them in a le you can do the same as in listing 3. Afterwards the keys for each Pi need to be added to the master as described in listing 4. pi@robb ~ $ ssh-keygen -t rsa Generating public/private rsa key pair. 3 Enter file in which to save the key (/home/pi/.ssh/id_rsa): (nothing) /home/pi/.ssh/id_rsa already exists. Overwrite (y/n)? (y) Enter passphrase (empty for no passphrase): (nothing) Enter same passphrase again: (nothing) 8 Your identification has been saved in /home/pi/.ssh/id_rsa. Your public key has been saved in /home/pi/.ssh/id_rsa.pub. The key fingerprint is: c5:85:58:cd:53:93:3f:06:38:bf:f2:59:33:c5:07:2f pi@robb The key’s randomart image is: 13 +--[ RSA 2048]----+ | o.+o.o. | | ...++.o. | | o o..= | | . .Eo*| 18 | S o.+| | . . + | | o o o| | o | | | 23 +-----------------+ Listing 3 Make ssh key 9
  • 10. pi@robb ~ $ ssh pi@100.0.0.12 mkdir -p .ssh 2 pi@100.0.0.12’s password: (raspberry) pi@robb ~ $ cat .ssh/id_rsa.pub | ssh pi@100.0.0.12 ’cat .ssh/authorized_keys’ pi@100.0.0.12’s password: (raspberry) Listing 4 Add ssh 11. Finally MPI can be tested on our system. (a) Start o by making a machine le. This le contains all the IP addresses of the computers we want to use MPI on. An example can be found in listing 11a. 100.0.0.11:1 = robb 100.0.0.12:1 = jon 100.0.0.13:1 = sansa 100.0.0.14:1 = arya 100.0.0.15:1 = brandon 100.0.0.16:1 = rickon Table 2 machinele (b) Next up, add the source code to each node. This can be easily be achieved with the code in 5. 1 pi@robb ~/temp $ scp -R folder pi@IP:destination Listing 5 Copy (c) Now each Pi has the code and we have the machinele. Eventually compile and run the code, shown in listing 6. The -std=c++0x ag allows us to use range based for loops. Use compiler optimization by changing the -o in -o3. This results in a faster run time but slower compile time. pi@robb ~/temp $ mpicxx cppfile -std=c++0x -o outputfile pi@robb ~/temp $ sudo mpiexec -np 2 -machinefile machinefile ./outputfile Listing 6 Compile en run 10