diploma

STUDENT DECLARATION
I, Gábor Apagyi, declare that I have created this diploma without any unauthorized help,
using only the specified sources (literatures, tools, etc.). Every section is obviously
marked with the source, where I have used the original text verbatim or rephrased.
I give permission to BME VIK to publish the basic data of this work (author(s), title,
abstract in English and Hungarian, date of creation, name of consultants) in electronic
form within reach of everyone, and the full text of the work through the intranet of the
university. I declare that the handed in and the electronic versions are identical. Text of
encrypted diplomas, with permission of the Dean, are only become accessible after 3
years.
Dated: Budapest, 12/05/2015
...…………………………………………….
Gábor Apagyi

HALLGATÓI NYILATKOZAT
Alulírott Apagyi Gábor, szigorló hallgató kijelentem, hogy ezt a diplomatervet meg nem
engedett segítség nélkül, saját magam készítettem, csak a megadott forrásokat
(szakirodalom, eszközök stb.) használtam fel. Minden olyan részt, melyet szó szerint,
vagy azonos értelemben, de átfogalmazva más forrásból átvettem, egyértelműen, a forrás
megadásával megjelöltem.
Hozzájárulok, hogy a jelen munkám alapadatait (szerző(k), cím, angol és magyar nyelvű
tartalmi kivonat, készítés éve, konzulens(ek) neve) a BME VIK nyilvánosan hozzáférhető
elektronikus formában, a munka teljes szövegét pedig az egyetem belső hálózatán
keresztül (vagy hitelesített felhasználók számára) közzétegye. Kijelentem, hogy a
benyújtott munka és annak elektronikus verziója megegyezik. Dékáni engedéllyel
titkosított diplomatervek esetén a dolgozat szövege csak 3 év eltelte után válik
hozzáférhetővé.
Kelt: Budapest, 2015. 05. 12.
...…………………………………………….
Apagyi Gábor

3
Budapest University of Technology and Economics
Faculty of Electrical Engineering and Informatics
Department of Automation and Applied Informatics
Gábor Apagyi
ANALYSIS OF EXECUTABLE
GRAPH MODEL
CONSULTANTS
Dr. Gergely Mezei
Ferenc Nasztanovics (Morgan Stanley)
Loránd Szöllősi (Morgan Stanley)
BUDAPEST, 2015

4
Table of contents
Abstract............................................................................................................................ 7
Összefoglaló ..................................................................................................................... 8
1 Introduction.................................................................................................................. 9
1.1 The structure ........................................................................................................... 9
1.2 The domain ............................................................................................................. 9
1.3 Motivation............................................................................................................. 10
2 Theoretical background ............................................................................................ 11
2.1 Graph theory ......................................................................................................... 11
2.1.1 History ........................................................................................................... 11
2.1.2 Definition....................................................................................................... 12
2.1.3 Important properties of graphs....................................................................... 13
2.1.4 Algorithms ..................................................................................................... 15
2.1.5 Executable graphs.......................................................................................... 20
2.2 Distributed environments...................................................................................... 20
2.2.1 Software solutions ......................................................................................... 21
2.2.2 Advantages and drawbacks............................................................................ 27
2.3 Scheduling ............................................................................................................ 28
2.3.1 Cost of scheduling ......................................................................................... 29
2.3.2 Type of schedulers......................................................................................... 31
2.3.3 Algorithms ..................................................................................................... 32
2.4 Compilers.............................................................................................................. 36
2.4.1 Compilation process ...................................................................................... 36
2.4.2 Compiler optimization................................................................................... 37
2.4.3 Higher level languages................................................................................... 37
2.5 Stack programming............................................................................................... 38
2.5.1 Reverse Polish Notation................................................................................. 40
2.5.2 Forth............................................................................................................... 41
3 Design and implementation....................................................................................... 43
3.1 Business problem.................................................................................................. 43
3.1.1 Morgan Stanley.............................................................................................. 43
3.1.2 Pricing............................................................................................................ 44

5
3.2 Modelling the problem.......................................................................................... 45
3.2.1 Graph representation...................................................................................... 45
3.3 Building up a graph............................................................................................... 47
3.3.1 Quadratic equation......................................................................................... 47
3.3.2 IProcessable interface .................................................................................... 47
3.3.3 Node types ..................................................................................................... 48
3.3.4 Edge types...................................................................................................... 49
3.3.5 Graph object................................................................................................... 49
3.3.6 Example graph............................................................................................... 50
3.4 Simple execution models...................................................................................... 51
3.4.1 Recursion based execution............................................................................. 52
3.4.2 BFS based execution...................................................................................... 53
3.4.3 Simulating stack execution............................................................................ 54
3.4.4 Disadvantages of high level modelling.......................................................... 54
3.5 Execution model ................................................................................................... 55
3.5.1 Tokens............................................................................................................ 55
3.5.2 Execution ....................................................................................................... 56
3.6 Compiling ............................................................................................................. 56
3.6.1 Transforming IProcessable objects................................................................ 57
3.6.2 Order of compiling......................................................................................... 60
3.7 Distribution of tasks.............................................................................................. 60
3.7.1 Random scheduler.......................................................................................... 61
3.7.2 Greedy scheduler ........................................................................................... 61
3.7.3 Separation of tasks......................................................................................... 62
3.8 Improving the execution model ............................................................................ 62
3.8.1 Sync and wait operators................................................................................. 62
3.8.2 Modifications in the compiler........................................................................ 64
3.8.3 Execution logic modifications ....................................................................... 65
3.9 Visualization ......................................................................................................... 66
3.10 Measuring methods............................................................................................. 69
3.10.1 Time measurement....................................................................................... 69
3.10.2 Measuring memory usage of the execution................................................. 70
4 Testing and conclusions............................................................................................. 71
4.1 Test environment .................................................................................................. 71

6
4.2 Test cases .............................................................................................................. 71
4.2.1 Quadratic equation......................................................................................... 71
4.2.2 Random graph................................................................................................ 72
4.3 Test runs................................................................................................................ 73
5 Improvement possibilities ......................................................................................... 77
Acknowledgements ....................................................................................................... 78
Bibliography.................................................................................................................. 79
Appendix........................................................................................................................ 82

7
Abstract
Environment changes rapidly these days – technologies of today will be obsolete
tomorrow, meadows become metropolises in years and even into the seas new islands can
be built. This rapid lifestyle is reflected in everyday life as well. Fast-food restaurants,
fast cars, speed fitness are usual words today, in every corners of our life we can feel the
rapidity. In financial sector agility is more out-standing. People open and close bank
accounts online in about 10 minutes or buy stocks from many different companies in a
flash. To remain competitive in such environment, banks like Morgan Stanley has only
one choice: be the quickest player on the market. Morgan Stanley realized this fact and
puts serious effort into their research activity.
Pricing of the available assets on the market is a very complex task and it is also
very important to do it quickly. Calculating a meaningful prices one second before
another participants of the market means clear business advantages. My thesis is intended
to analyze the possibility to model pricing process using graphs and find an effective way
to execute the constructed model.
Taking the very first step, I started analyzing graph structures and algorithms,
distributed system architectures, scheduling algorithms and possible executing models.
My focus was caught by the execution models as I saw improvement possibilities in the
usual implementations and innovation possibilities for example executing the graphs on
FPGAs.
During my work I wanted to keep the expressiveness of the original reference
model but I also wanted to improve the performance. My intentions ended up in
implementing an expressive reference model, an effective execution model and a
compiler which transforms one model into another.
In the testing phase, I measured 20% performance improvement in average in non-
distributed environment compared to the original reference model. In distributed
environment, using an advanced schedulers, the improvement compared to the original
model is even more outstanding.
At the end of my thesis, I listed a couple of development opportunities, which
would make the performance even better and the system more user-friendly.

8
Összefoglaló
Napjainkban a környezetünk rendkívüli ütemben változik – a mai fejlett
technológiák holnapra elavulnak, évek alatt a legelők helyén városok épülnek vagy akár
szigetek nőnek ki a tengerből. A gyorsuló életstílus a mindennapi életünkben is
észrevehető. Gyorsétterem, gyors autó, speed fitness - gyakori szavak manapság. A bank
szektorban ez a gyors ritmus még jobban megfigyelhető. Az emberek 10 perc alatt
számlákat nyitnak, vagy szüntetnek meg, esetleg pillanatok alatt részvényeket vásárolnak.
Ahhoz hogy ilyen környezetben egy bank, például a Morgan Stanley versenyképes
maradjon egy lehetősége van: a leggyorsabbnak kell lennie a piacon. Ezt a tényt a Morgan
Stanley is felismerte és komoly kutatási tevékenységbe kezdett.
A termékek beárazása a piacokon nagyon komplex feladat, illetve szintén nagyon
fontos a folyamat tényező a folyamat gyorsasága. Egy másodperccel hamarabb tudni egy
termék árát, mint a többi szereplő a piacon komoly üzleti előnyt jelent. Dolgozatom célja
megvizsgálni a lehetőséget, hogy az árazást gráfokkal modellezzük, illetve kidolgozni
egy hatékony végrehajtási módot ezen modellek részére.
Első lépésként megvizsgáltam a gráf adatszerkezetet és algoritmusait, elosztott
rendszer architektúrákat, ütemező algoritmusokat és lehetséges végrehajtási modelleket.
Ezen témák közül a végrehajtási modellekben fedeztem fel továbbfejlesztési és
innovációs lehetőségeket, mint például a gráf futtatása FPGA-n.
Munkám során fontosnak tartottam megőrizni az eredeti gráf reprezentáció
kifejezőképességét, de növelni akartam a rendszer teljesítményét. Végül három részre
bontottam az implementációt: egy nagy kifejezőképességgel bíró modellre, egy hatékony
végrehajtási modellre, illetve egy fordítóra, ami a két modell közötti átjárhatóságot
biztosítja.
A tesztelés átlagosan 20%-os teljesítménynövekedést mutatott egy processzoros
környezetben az eredeti referencia implementációhoz képest. Több processzoros
környezetben, fejlett ütemezőt használva a teljesítménynövekedés még szembetűnőbb
volt.
Dolgozatom végén számos továbbfejlesztési lehetőséget felsorolok, melyek
megvalósításával a rendszer teljesítménye tovább növelhető, illetve jobb felhasználói
élmény érhető el.

9
1 Introduction
This introduction chapter gives a short overview of the topic, the structure and the
desired goal of this paper.
1.1 The structure
The paper is arranged into chapters, which ones guides the reader from the
theoretical background to the implemented solution and tests. The first chapter is the
introduction, it clarifies and details the borders of my task. The second chapter provides
a theoretical background and describes the principles and methods used in the paper. The
design and implementation chapter gives insight to the implementation and explains some
of the most important design decisions. Chapter four evaluates my solution and tests its
performance. The fifth chapter proposes improvement possibilities which can improve
the results.
1.2 The domain
Morgan Stanley[1][2]
is one of the biggest investment bank in the world. In order
to do their everyday work, they use lots of different incoming data feed. If processing of
these feeds is quick enough, the feeds can provide help to make decisions based on the
actual situation or at least the most updated view of the market. When processing speed
exceeds the limit of human limitations, speeding up the systems is not necessary anymore.
Or is it? The answer is definitely: “Yes, it is.” It is essential to boost the speed of
processing. When human factor is left out from the equation, whole systems can be built
relying on the data streams to lift trading onto a higher level. These new systems are called
automated trading platforms and recently they are becoming more and more important.
As all market data is available to every participants of the market, the number of
possible ways to gain advantages over the competitors is very limited. Banks can either
fine tune their systems to boost up the performance or can develop new algorithms and
techniques to make better decisions.

10
1.3 Motivation
Let us assume that, we have a mathematic model of a market, where we can
express connections between products and their prices. Moreover assume we have some
other solution to express the same problem in another way. If we compare these models,
there are two important properties to check. The first is how accurate the model is, how
well it describes the real market. The second is how fast it is to obtain new prices of
products when the incoming data feed - related to the particular product - changes.
From Morgan Stanley’s point of view, having the most advanced solution (both
in terms of accuracy and performance) means direct advantages on the market. As they
realized how much they keen on innovation and new technologies, they started to invest
serious effort in research projects. My work compares some possible solutions, tries to
maximize the performance of one possible solution and also tries to come up with some
guidelines about future possibilities.

11
2 Theoretical background
This chapter is intended to give the reader a better understanding on the terms,
laws and some thesis used. I will provide the necessary amount of information to
understand my work and I will also mark the points, which can be used as a good starting
point for extending my solution.
2.1 Graph theory
As this paper is all about graphs, first important step is to understand what a graph
is, why graphs are important, how we can leverage them in our everyday life. The
following section will guide us through the important parts of graph theory related to this
paper.
2.1.1 History
There are number of times, when understanding a complex problem is much easier
with a small illustration. “Seven Bridges of Königsberg” problem[3]
is one of that type.
In the 18th
century, Leonhard Euler was asked by the citizens of Königsberg with
the following question: “Is it possible to visit all mainland of the city, while every bridge
will be crossed once, and only once?”
2.1.1 – Map of Königsberg
As you can see on the map, the city is crossed by the river Pregel and the lands
of the city therefore are connected by seven bridges. While Euler tried to answer this
question, he basically formed graph theory. Let us take a look at his solution.

12
Solution of Königsberg problem
First, we need to get our problem domain smaller. We do not need to see the
houses of the city, nor the river, nor the bridges. All mainland parts are be represented by
a simple dot and bridges by lines. Here comes the above mentioned principle about
illustration. Try to redraw the map using only dots and lines. The result should be
something as in figure 2.1.1.1.
2.1.1.1 – Redrawn map of Königsberg
It is much easier to analyze the model without the unimportant details. As Euler
observed, if we want to walk through all the bridges (edges), we have to “get in” and “get
out” of a mainland (dot, node) – except the first and the last node, if we do not want to
arrive back to our starting point. Let us call the number of the edges crossing a node the
degree of the node. Euler’s observation implies that each node degree must be even in
our map (except the degree of the first and the last node, if they are not the same).
If we calculate the degree of all nodes, we will see each node has an odd degree.
In real life this result means: after a while, when we get into a mainland, we will not have
any unused route out of it. This determinates the answer of the original question: there is
no way to take a walk in Königsberg and visit all mainland while crossing every bridges,
but crossing each of them exactly once. This type of walk in a graph, where each of the
edges are visited once and only once are called Eulerian path or Euler walk in his honor.
The above mentioned observation about node degrees is a necessary and sufficient
condition for the existence of Eulerian paths.
2.1.2 Definition
At first, let us define what graph means: [4]

13
Graph is an ordered pair G = (V, E) comprising a set of vertices or nodes (V) and
a set of edges (E).
2.1.3 Important properties of graphs
2.1.3.1 Edge direction
Graphs can be grouped by the direction of its edges. An edge (E) is directed if
𝐸 = (𝑣1, 𝑣2); 𝐸1 = (𝑣2, 𝑣1); 𝐸 ≠ 𝐸1; 𝑣1, 𝑣2 ∈ 𝑉
and it is undirected if
𝐸 = (𝑣1, 𝑣2); 𝐸1 = (𝑣2, 𝑣1); 𝐸 = 𝐸1; 𝑣1, 𝑣2 ∈ 𝑉
A graph is considered as undirected graph if it does not contain any directed edges,
otherwise it is referred as directed graph. Directed graphs can be created from undirected
graphs by replacing every edges by two new edges which connect the same nodes but
their direction is reversed. This conversion always can be done without modifying the
meaning of the graph. However, conversion in the reverse direction can possibly modify
the underlying meaning of the model.
2.1.3.1 – Converting undirected graphs to directed graphs and vice-versa
In this paper we are going to use mostly directed graphs as they can express
connections between stock prices in a natural way if we imagine a directed edge as an
arrow. If an arrow (edge) points from a stock price to another stock price (nodes) it means
changes in the source stock price have an impact on the destination stock price.
2.1.3.2 Cyclic/acyclic property
A graph is considered cyclic if there exists a sequence of nodes, where each
following nodes are connected by an edge and the end node of the sequence is equals to
the beginning node of the sequence. Formally:
𝑋 = (𝑣1, 𝑣2, … , 𝑣 𝑛); 𝑣1 = 𝑣 𝑛; 𝑣𝑖 ∈ 𝑉

14
X is called a circle of the graph. A graph is acyclic if it does not contain any
circles.
2.1.3.2 – Graph with circle on the left, without circle on the right
2.1.3.3 DAG property
DAG[5]
stands for Directed Acyclic Graphs. From the previous definitions it is
very straightforward to see the meaning of this property. A DAG is a graph, which is
directed and does not contain any circles. This class of graphs is very important, since
there are many problems considered to be solvable if the underlying graph is a DAG.
Let us suppose we have a collection of tasks that must be ordered into a sequence
(and suppose such ordering exists) and we have a collection of constraints, i.e. rules
stating a task must be completed before another task is started. Constrains can be
expressed by directed edges between nodes representing the tasks: the source node of the
edge must be completed before the destination node of the edge.
2.1.3.3 – Directed Acyclic Graph
As the source node of an edge must be completed earlier than the destination node,
starting from a randomly selected node 𝑛1 and assuming we can have a circle in the graph,
it is possible to reach the starting node again after a while. Let us name the last node 𝑛𝑖
from where we return to 𝑛1. The result implies 𝑛1 must be completed before 𝑛𝑖, as we

15
started from 𝑛1 and reached 𝑛𝑖. It also implies 𝑛𝑖 must be completed before 𝑛1 as we have
an edge which is directed from 𝑛𝑖 to 𝑛1. This is an obvious conflict which means the
graphs must not contain a circle.
Given a set of nodes (V) and edges (E) a topological order is a 𝑣1, 𝑣2, 𝑣3 … 𝑣 𝑛
sequence, where
∀𝐸 = (𝑣𝑖, 𝑣𝑗) ∈ 𝐸(𝑣1, 𝑣2, 𝑣3 … 𝑣 𝑛); 𝑗 = 𝑖 + 1; 1 ≤ 𝑖 < |𝑉| 𝑣𝑖 ∈ 𝑉
It can be proved that if topological order exists, the graph is a DAG and vice-
versa. It is also a fact that topological order is not always unique which means that there
can exist more ordering for a given graph. When it comes to scheduling differences
between the orderings give us the opportunity to optimize our solution based on several
criteria (e.g.: running time, cost, finishing time, etc.).
2.1.4 Algorithms
Graphs are powerful structures for representing real life problems such as
simplified stock pricing. For formalized problems it is easier to create general solutions.
In this section, several algorithms are elaborated.
2.1.4.1 BFS
Breadth First Search is a method to find a particular node in the graph starting
from a given point or prove that the point is not reachable from the given start point.
The algorithm starts by visiting the neighbors directly reachable from the starting
point. Let us consider that the starting point is on level 0. In this case, level 1 contains all
directly reachable nodes from the starting point, level 2 contains those nodes, which can
be directly reached from level 1 nodes, etc. Generally, level N contains nodes which are
reachable from level N-1, but not yet reachable from N-2.
The algorithm stops when it finds the required node or there are no more
reachable, unlabeled nodes. It is possible that the algorithm does not process all the nodes
– only nodes reachable from the starting point will be processed. This fact gives us tools
to identify properties such as reachability from a given node, to prove that the graph is
fully connected or not.
There are two important good-to-knows about this labeling procedure. Firstly, the
algorithm processes all available nodes on level N – which basically means identifying

16
nodes on level N+1 – before processing any nodes on a higher level. Secondly, the
algorithm goes forward, which means it skips all nodes which are already labelled with a
lower number.
2.1.4.1 – BFS ordering – each node on level N gets processed before going onto level N+1, order of
nodes on a particular level is not defined
The way of determining which node to process next is not specified in the original
algorithm, thus, it can be chosen. It can be full random or can use various data to make
better decisions. The simplest way is to queue the nodes, thus when a new node is
identified to be processed, it is put at the end of the queue. The first element in the queue
will be processed next.
From a higher perspective, level numbering can be looked as a super simple
heuristics. If we use problem specific knowledge, there are wiser ways to shepherd the
order of the processing, however using more specific heuristics can easily lead to a
whisper which says we should reorder processing even if we cross levels even if it leads
to an algorithm, which is not a BFS anymore. For example, speaking about route search
on a map – which can be easily tracked back to a search in a directed graph – it is wiser
to use air distance rather than leveling. But as I mentioned earlier, this leads to conflict
with leveling. It is more useful to process nodes with lower air distance metrics to the
destination even if these nodes are on higher levels from level number perspective.
However air distance heuristics can be wrong and lead the algorithm to a dead-end where
we are really close to the destination but we cannot reach it.

17
As we saw there are trade-offs when designing or using algorithms. BFS can be
easily implemented and used in various situations (detecting partitions in graphs,
traversing all the nodes, leveling problems, etc.) but also can be an in effective mechanism
as we illustrated with the map example.
2.1.4.2 DFS
Depth First Search is really similar to BFS. The difference between the two
algorithms are in the order of the processing. In BFS we use levels to identify a group of
nodes to process. In DFS we can also use the level term, but the applied rules will be
different.
Let us suppose we have a starting point and we know the level numbers for all the
nodes (level numbers are specific to the starting point). DFS will pick a node from the
directly reachable nodes (level 1 nodes). However, rather than picking the next node from
level 1, the algorithm will pick the next node from the directly reachable set of nodes (on
level 2) from the previously selected level 1 node. If there are no more reachable next
level nodes, the algorithm will jump back and select the next node from a previous level.
The exit criteria is the same as in BFS algorithm. The algorithm stops if it finds
the searched node or there are no more unprocessed node. It is also possible that the
algorithm does not process every nodes as the graph is not necessary fully connected, thus
not every nodes are reachable from the given starting node.
This means that all the reachable graph nodes will be processed before processing
the next node on level 1. The name of the algorithm comes from this fact, as it searches
deeply inside the graph while BFS first checks the closest nodes and slowly grows the
checked area.
From the implementation point of view, DFS can also be implemented easily with
a queue. In this case, new nodes will be put in the beginning of the queue and we will
pick the next node also from the beginning of the queue.

18
2.1.4.2 – DFS order with random selection. Level 3 node gets processed before lower level nodes are
finished.
Note that the previously mentioned modification of BFS with the air-distance
heuristics silently turned out to be a DFS. Heuristics is used when deciding on which path
to go forward.
2.1.4.3 Traversal using search algorithms
There are cases, when we want to apply functions to every node in a graph rather
than just finding one particular node.
Technically when processing a node, we can apply any function to it. For instance
we can write out data from the node or accumulate its value to a global variable. The only
problem is to ensure that we have visited all the nodes. With a small modification, the
described BFS and DFS algorithms can do this for us.
Remember in both cases the algorithm has exit criteria – namely finding a node
which fulfills the search criteria or getting to the end of the graph. If we ensure that the
search predicate will never evaluate to true or just simply skip checking the search criteria,
the algorithms are guaranteed not to stop before reaching the end of the graph.
Actually we still have an issue with this approach. It is not ensured that every
nodes are reachable from the starting point – it is not guaranteed even in a randomly
selected undirected graph, since graphs can have separated partitions. The solution is to
check that we have processed all the nodes in the graph or not when we reach an endpoint

19
(for example by counting the number of processed nodes and comparing to the total
number of the nodes in the graph). If the graph still contains unprocessed nodes, randomly
select one of them and restart the algorithm using the selected node as starting point.
2.1.4.4 Topological ordering
As we saw earlier, topological order carries important and meaningful properties
such as the DAG property. The following algorithm gives a simple solution to generate a
topological ordering if it exists.
We describe the algorithm using scheduling as an example. The initial setup is the
usual: we have a directed graph where nodes are the tasks and directed edges express
precedence between the tasks.
Firstly, algorithm selects all the nodes which have no outgoing edges. These nodes
will be processed as last since there are no dependencies on them. If no such nodes exist,
that means no topological order exists, hence it means the graph contains at least one
directed circle. Let us leave out these nodes from the graph and repeat the previous step.
Then repeat again until no more nodes remain in the graph.
2.1.4.4 – Execution flow of topological ordering algorithm
It is easy to prove that algorithm gives a topological order of the graph and can be
implemented efficiently. As mentioned earlier, topological order is not unique. Given the
result of this algorithm multiple topological orders can be created if we know which nodes
belongs to which iteration. Nodes added to the schedule in the same iteration can be
shuffled as they do not influence each other and their execution.

20
2.1.5 Executable graphs
Executable graphs are more than usual DAGs by storing a function inside the
nodes and the edges. In general, these function can do many operations from logging to
changing its value based on some criteria. Executing the graph means to visit all the
elements of the graph and run their stored functions.
Execution starts at the nodes which are marked as inputs. When all the nodes are
run, we create a collection from the values of the nodes which are marked as output. This
collection will be the result of the execution.
Execution may raise interesting questions. Consider we have a huge graph with
multiple inputs and outputs. What happens if an input changes? Do we really want to
recalculate the whole graph or is it possible to recalculate only the affected parts? The
starting point of the execution is locked, but the execution flow can shape very differently
based on the algorithm used. In this paper, I am searching for a proper way to optimize
the execution of this type of graphs. By the end of this paper, the reader will have a good
understanding about the possibilities, challenges and pitfalls of the problem.
2.2 Distributed environments
Computer science has been changing on a very high pace since the first computer
was turned on. According to Moore’s law[6]
, computing capacity will be doubled every
two years. This law looks accurate based on the last couple of decades.
This fact also means shifting in problems in computing science as well as the
development of the underlying hardware. First computers were giant, monolithic
machines which executed programs sequentially. Sequentially means that programmers
had to explicitly declare the order of the commands to be executed. In addition, back in
these days the machines were exclusively reserved for the program under execution.
This may sound strange, since today users are listening to music, while chatting
on the internet and editing a document in a text editor – probably on a device which fits
into your pocket. This example frames the shift very well – in old days a programmer
knew the program will run on a dedicated machine and the interpreter is going to execute
the commands in a well-defined order. In the era of “internet-of-things” this model has
been dramatically changed. Multiuser, multitasking, parallel programming, remote
execution and asynchronous execution are concepts which developers of today are needed

21
to be aware of and deeply understand them in order to develop high-end, modern
applications and leverage hardware capabilities.
2.2.1 Software solutions
It is usually hard to clearly separate hardware and software solutions as they both
require support from the other side as well. XXIst century is characterized from IT
perspective mainly by internet, mobile and recently cloud boom[7]
. Hardware
developments are continuing as well, but as hardware capabilities are sufficient for
various everyday usage, software engineers’ focus is now shifted to serve other purposes
as well like user experience, scalability, reliability and etc. Altered or at least extended
goals require new approaches.
2.2.1.1 Parallel programming
While in most cases, a mobile device has more computing resources than the first
mainframe computers had, these devices are used to solve more complex problems as
well. To leverage hardware capabilities, programmers are needed to be aware of parallel
programming[8]
. In the sequential model there exist an exact order of the commands which
is defined by the developers – and optimized by the modern compilers. Considering cases
when the execution of two commands do not have effect on each other, a natural solution
is to boost the execution time. Use the available free processors to compute distinct parts
of a complex expression.
2.2.2.1.1 – Parallel execution of (1+2)*(3+4) with one and two CPUs

22
While multitasking does not require paradigm change among developers, parallel
programming does. Designing parallel algorithms is a very creative, intuitive and
innovative process. While best practices exist, there is no ultimate way to design parallel
algorithms or convert sequential algorithms to parallel mode. However, in most of the
cases, developers get a huge payback in trade of the effort made to design the parallel
algorithms. Depending on the level of parallelization, there is a fairly simple way to
calculate the profit. A simple theoretical upper barrier of the obtainable speed-up, given
the sequential execution time is 𝑇 and we use 𝑛 processors, is 𝑇 −
𝑇
𝑛
. In practice, this
barrier cannot be exceeded, reached and even with the wisest design, usually we cannot
get near to this limit. However, we can influence the order of magnitude of the execution
time. First reason for this limitation is that parallel programming also introduces
synchronization and governing overhead. The second is: parts of a parallel executed
problem are needed to be joined at a certain point. It means that even if we could
completely parallelized a problem, reaching the end of the execution line will cause some
processors to be idle and others to calculate the joined results. In real life problems only
parts of the algorithms can be parallelized, thus during the execution, parallel and
sequential parts follow each other. The more alternating parts means the more
synchronization overhead is required. This fact explains well why parallel programming
introduces serious overhead. It is worth to mention that, by using overlapping between
execution cycles we can counteract some of the overhead. Overlapping means we start to
execute the next cycle before the current cycle is finished to leverage idle time of the
resources.
2.2.2.1.2 – Alternating between sequential and parallel execution
2.2.1.2 Threading
While parallel programming leverages multiple CPUs, threading aims to run
programs in parallel on a single CPU. This concept is very similar to multitasking, while
multitasking is the idea, threading is the implementation.

23
In computer science, a thread of execution is the smallest sequence of
programmed instructions that can be managed independently by a scheduler, which is a
typically part of the operating system.[9]
Threads could be defined in some other ways as well, however this definition
carries much additional information about threads.
The first important fact is that a thread is not a program or source code, but the
execution of them. It means threads exist only in execution time which implies that
debugging of multithreaded programs is hard since static analysis is much harder or even
impossible to perform. The first fact also means that a big execution flow can be broken
into smaller parts – manually or semi-automatically. The concept of threads advances the
need for schedulers, which we will discuss in more detail in the following chapter. As the
definition above shows, schedulers and threads are strong parts of every single modern
operating systems - threading enables to execute multiple programs on a single CPU or
execute single program on multiple CPUs.
Going deeper into the hardware and analyzing how threading works, it becomes
clear that threading can be very expensive and after a certain number of threads, creating
new threads will seriously degrade the performance of the system. This is caused mainly
by the context switches. When a CPU executes a program, it loads the program and the
data into the memory, populates the required registers, positions the PC (Program
Counter) and then steps through all the instructions. But what happens, if in the middle
of the program, the CPU is get preempted and it needs to start to execute another program
– knowing that we want to finish the intermittent program later? The CPU has to save the
current execution state, load the environment of the new program, which is probably a
previously preempted program, and execute it. This is called context switch.
It is very important to keep the number of context switches at a reasonable level.
The time, which the CPU spends on changing between contexts is the overhead of the
threading. If the computer has only one CPU installed, this overhead is the cost for
multitasking and of course it means the users cannot leverage 100% of the computing
capabilities. Moreover, we cannot expect boost in the execution time (but we can expect
it in throughput). When using multiple CPUs, advantages of threading are easier to notice.
While we get the ability of multitasking, we can also gain serious performance boost.

24
2.2.1.3 Remote execution
Mobile devices and personal computers can carry huge computing and storage
capabilities, but there are cases when it makes sense to separate concerns: to store sensible
data in replicated and safe data houses, to execute complex calculations in the cloud and
to interact with the user through personal devices.
In this approach, the actual footprint of the system is determined only at runtime
and can be dynamically changed. It also requires communication between the parts of the
system. Vigilant readers can spot that, this model is a scaled-up version of an everyday
computer. These concerns are also separated in personal computers, but the distance
between the parts is much smaller. If the user wants to access a file from the EU which is
stored in the US, a request must be sent to the data warehouse and the data must be
transferred over at least 4-5 thousands kilometers. One challenge is to overcome this issue
and provide a competitive solution opposite to store the data locally. Nowadays internet
access is fast enough to leverage the advantages of the described scenario and avoid the
latency. There are some serious advantages as well- data warehouses provide 24/7
support, insurance, replication, competent professionals, high standards and almost
unlimited storage capabilities which can be easily extended based on user needs.
Speaking of clouds, advantages are probably not clear for everybody at the first
sight. People usually run multiple programs at the same time. If we consider not just
everyday users but also professionals, creative people, such as 3D model creators, image
editors run complex algorithms which can run for days even on a high-end computer or
to stick with the topic, build and compile complex computer programs can take minutes,
hours or even days. For these scenarios, cloud computing is a viable solution. But what
is the cloud? By definition, cloud is a pool of resources which can be dynamically
allocated for users based on their needs. It means if developers want to build and compile
a complex program, they simply allocate the resource they need and run their task.
Everybody pays after allocated resources. In 3D modelling, rendering an image can take
days on limited resources. In cloud, rendering time can be influenced by the amount of
the allocated resources. If you need to demonstrate the current state of your scene quickly,
you can allocate 10 times more resources than usual and demonstrate your results in
hours. Also you have to pay more, but only for the time period when you use the
resources. It means reduction in costs and at the same time serious improvement in
productivity.

25
Does it worth to invest serious amount of energy, time and money to make remote
execution possible? Fortunately, these technologies are designed and tend to be
transparent to the user and as much as possible to the developers as well. Remote
execution requires complex underlying systems, but companies, like Google[10]
,
Microsoft[11]
, Amazon[12]
, etc. offer these systems out of the box. It does not mean that
developers do not have to modify their code to be able to leverage these technologies, but
it means, they do not have to be aware of how these functionalities are implemented.
Usually they get an API and build their systems against it while behind the scenes remote
calls, remote storage, and remote calculations are used.
2.2.2.3 – Remote procedure calling architecture
The idea is to abstract away calls and replace the actual implementation with an
interface which has the same public available methods, properties and etc. Developers
will see the same methods, properties and they call these methods and use the properties
in the same way as they usually do. But behind the scenes the actual implementation of
the interface will just simple pack the request into a package, send it to an execution server
which can be in the cloud. The execution server will instantiate the original
implementation and call the method with the parameters of the original call. Once the
execution is finished, the result will travel back to the client side skeleton and the original
calling object will see the result. The beauty of this idea is that neither the calling object,
nor the serving object is necessary to know about the fact that the call is made remotely.

26
Moreover, if the design of the original code satisfy some criteria, support for remote
procedure call or remote storage, can be injected automatically.
2.2.1.4 Asynchronous calls
Imagine a scenario, when you have only one CPU and you build a system which
leverages remote procedure calls, remote storage and you also want to support
multitasking via threading. Let us say, you want to calculate and show the aggregated risk
for your whole business. This requires to load data from your remote storage, execute
some complex calculations on it and populate a fancy chart with the results. You run the
business for 5 years now, you have 10 gigabytes historical data and the calculation takes
10 minutes in the cloud. During the calculation, you want to switch to your client
communication module inside your application and you plan to reply some user
feedbacks. So you kick off the calculation process and try to switch to another module,
but the application is frozen.
In this case, I presented an unfortunately very common use-case. Developers
usually forget the fact that remote execution does not solve every problem. In this case,
you started a process which takes minutes to complete due to the remote storage and
complex calculations. The developers of your application were not aware of threading
and asynchronous methodology. When you click on the button which starts your
calculations the caller is blocked until the results arrive – in this case, user will be blocked
for at least 10 minutes. It is called synchronized method call.
Threading can be a solution. When you start the process, the application will
create a new thread and execute the call on the thread. This means that the main thread,
which is almost in every cases the UI thread, will not be blocked.
An alternative solution is the asynchronous call[13]
, which basically does the same
but abstracts the details away. Programming languages usually contains keywords to
mark a call as asynchronous. Behind the scenes, the compiler replaces the keyword with
a wrapper around the call, starts and initializes a new thread and executes the method.
Nowadays in modern applications, it is a very important requirement to be smooth
and reactive. If an application hangs for more than one second, users think something
went wrong. Asynchronous notion is very similar to parallel programming as your
program does something parallel while the results are getting available from another

27
function call. Thinking about remote execution and clouds it absolutely makes sense to
use your local resources for other purposes, while remote execution runs.
2.2.2 Advantages and drawbacks
As we saw above it is worth the effort to invest into distributed environments. In
this section, I try to summarize the advantages and the drawbacks of distributed systems
and try to highlight when to use these techniques.
Nowadays multiuser systems does not mean the same as in the past. A single PC
is a multiuser system in sense of the old definition. Nowadays, multiuser systems are
huge, globally or at least widely available systems for thousands of users. When building
such systems, we cannot avoid distribution of concerns. Scalability and reliability usually
comes with wisely designed distributed system. If more users are interested, the support
team adds a new server to serve the higher number of requests. This model is also more
reliable, as if one server goes down, others can step up and serve the broken servers
requests, ensuring that support team has enough time to replace the broken one without
the users noticing that the original server was down.
Distributed system can naturally boost up the performance of several algorithms.
Although it has some limitations, advantages are clear. With some investment into a
distributed algorithm, the order of magnitude of the execution time can be seriously
improved.
Distributed systems can provide more fluent and transparent experience. It is also
more rational to let professionals take care about our data storage, while we can focus on
developing our business logic and the cloud professionals can take care of the execution
background of our solution.
On the other hand, usage of these technologies can lead to overcomplicated source
code, our algorithms and our systems. It is a common mistake to fine-tune an algorithm
for parallel execution which will be executed only a few times. Developers need to learn
how to leverage the advantages, while minimizing the drawbacks. For example, it is worth
to fine-tune an algorithm which is called every fifth seconds during the application
lifetime and now takes 4 seconds to complete. But it is not worth to do it with an
initialization part of the software which runs only once a day, but it takes ten minutes to
execute – or if it is that important to fire up the application quicker, we can write a script
which runs the application before the user gets into the office.

28
There are cases, when parallelization would be really necessary, developers put
serious effort into the design and at the end of the day it turns out that the solution is
wrong or the problem cannot be parallelized efficiently. Or it would take more time,
resource or money to do than the actual profit.
It is also worth to mention that, implementation, testing and debugging is much
more complicated than in case of a single, standalone, not distributed application.
Moreover, maintaining distributed systems requires professionals who really understand
how the system works, requires communication between vendors to solve problems, etc.
These factors highly influence the costs of a distributed system.
As a rule of thumb, we should think before deciding to build distributed systems.
Developers should mindfully analyze the requirements, think about the edge cases, the
advantages and also about the drawbacks and decide wisely.
2.3 Scheduling
Scheduling is the process of defining the order of the tasks [14]
. People use
scheduling all the time, for example recipes define in what order you put ingredients into
the meal, bus schedule defines the time when a particular bus arrives at the station etc. In
general, scheduling allows accessing to a set of resources by setting up rules.
Before the dawn of parallel programming, programmers defined the order of the
execution implicitly in their source code. If a command precedes another one in the
source, the execution of it will also precede the other one. Main goal of parallel
programming is to allow unrelated code blocks to execute simultaneously. In order to
achieve this functionality, programmers have to explicitly design the code and mark
parallel parts of the system. This model is useful, but leaves the responsibility of deciding
when to parallelize code in the hands of developers. It means freedom but can also lead
to possible errors or unused possibilities. It would be good to take over the responsibility
for marking parallel blocks automatically and just allow the developers to mark blocks
where they forbid parallel execution. Scheduling is capable to provide this functionality
from an aspect.
A good order of tasks can seriously boost the execution time, while it also can
seriously degrade the performance. A badly designed scheduling has huge impact on the
throughput of the system.

29
2.3 – Various execution times for different schedules
Figure 2.3 illustrates the case, when two processing units are working on a bunch
of tasks. The longest task is executed first which hangs the other processor because the
second has dependency on a task which is not yet calculated on the first one. Switching
the order of tasks for the first processing unit will dramatically lower the finishing time
of the job.
2.3.1 Cost of scheduling
As we saw in the previous example, scheduling can have huge impact on
performance. It is also important how we define the actual schedule, how we measure
how good the actual order of the tasks is and how quick we can generate an acceptable
schedule. The advantages of scheduling are strongly depending on these facts.
2.3.1.1 Defining the schedule
Usually schedulers are functions which answers the very basic question: “Which
task shall I run next? – CPUx”.
There are a number of possible ways to come up with an answer. In the Algorithms
part of this section, I am going to describe some of the available algorithms. For example,
the simplest algorithm is the random scheduler. It selects a task randomly from the
available ones and does not consider any optimization. Or the greedy scheduler, which

30
has one dedicated metric to optimize and greedily selects the available best task in order
to optimize the performance. Many more advanced algorithms exist, like the genetic
algorithm which tries to find the best solution using evolution.
2.3.1.2 Measure the goodness of the schedule
The goodness of a schedule is very subjective. While in a typical backend system
a good schedule can take hours to complete, in a frontend system we usually consider an
application good, if response time is under one second. But time is only one aspect. Others
may focus much more on energy consumption, disk space consumption or cost of the
execution.
We can compare the performance (based on any criteria, which is measureable)
of the old and the new implementation and we can calculate how well the new
implementation performs compared to the previous one. But it does not really say
anything about the best solution. Let us say we use overall execution time as the metric
for the goodness. If we use 𝑁 processors, the theoretical minimum is
𝑇
𝑁
where T is the
sum of the execution time of all the subtasks. It can be proved, that the minimum is not
always reachable, but it acts well as a low boundary for the goodness.
2.3.1.3 Generate the schedule
Another aspect is the creation of the schedule. Even if it is possible to create a best
schedule for a given problem, it is not that clear we want to reach that or not.
A random scheduler can be implemented easily and can quickly select the next
task to run. In contrast, a genetic algorithm requires much more time to create a schedule.
The complexity of the generator algorithm highly depends on the actual problem we try
to solve. If we have a graph which has 5 nodes and 6 edges, it is easy to implement a
method to check all available combinations and choose the best one. If we have a bigger
graph, let us say 100 nodes and 500 edges, the complexity of the tester and the required
time for the algorithm jumps to the air. Roughly estimation for the first case is 5! = 120
possible combinations, while the second case has 100! = 9.33 ∗ 10157
combinations.
This is too much to be solved.
Usually schedulers allow us to search for approximate, sub-optimal results, which
are between defined boundaries of the optimal solution. We can limit the time for the
scheduler and hope that the given result is good enough.

31
2.3.2 Type of schedulers
Schedulers can be grouped by a lots of properties.[15]
In this section, I am going
to describe one possible grouping which focus is mainly on the role of the scheduler
throughout the lifetime of the program.
2.3.2.1 Online schedulers
Online schedulers are actively participating shaping the execution flow during the
lifetime of the program. Their role is to monitor the status of the system and update their
metrics accordingly and make scheduling decisions based on the latest available state.
Online schedulers do not have a predefined order of the tasks and they can adopt
to new situations occurred during the execution. They usually tries to make as good
decisions as possible, but their goal is not to reach the optimum, but to be as flexible as
possible.
Online schedulers are mainly used in interactive systems and where the scheduler
algorithm allows and needs to apply the scheduling multiple times. The algorithms are
usually try to be simple – round robin, time slicing, etc. It is also important to use this
type of scheduler, when the system is being modified during the execution.
Operating systems are a good place to use online schedulers. Computers have
multiple CPUs with multiple cores, many processes tries to access the resources, the
scheduler usually runs each time when a new request comes in or on a periodic time to
schedule the running tasks. Good schedulers avoid starving of processes, provide justice
and support prioritizing.
2.3.2.2 Offline schedulers
Offline schedulers usually run only once in the lifetime of the program – or in a
given execution period. They are mainly used, when the problem is complex, the domain
of the problem does not change during the execution period and the scheduler algorithm
requires serious amount of time or resources to complete. They are also good choices,
when the execution is periodic.
Comparing to their online solutions, they are less flexible, as they cannot react to
changes in the environment during their execution period. They usually aim for better
decisions than online schedulers. They usually analyze the model of the problem and
create a suboptimal schedule.

32
Example for their usage can be a car factory. In the case the scheduler is the
production engineer who sets up the layout of the factory and defines the subtasks of
creating a new car. In computer science, an example could be the field of executable
graphs. These graphs are usually used more than once as their creation is very resourceful,
which means they do not really change between executions. The scheduler can create a
schedule for the graph and the executor can use this information during the execution
time.
2.3.3 Algorithms
Algorithms used in schedulers are specialized for a given problem, but they all
follow some common schemes. The following algorithms can be found in several
products on the market.
Note that each algorithm takes a list of tasks and figures out an order for the
execution, thus we can assume all of them are offline schedulers. For simplicity, each
algorithm description will use overall execution time as a key property to optimize and
will work on a DAG. Let us suppose we have 𝑁 processing units.
2.3.3.1 Random scheduler
Random scheduler is the simplest to implement. It takes the input nodes of the
graph and marks them as available to process as they do not have any dependencies. It
stores the list of the marked processes in a queue (FIFO container). The algorithm
simulates one run of the graph and create a mapping between the nodes and the processing
units.
In the beginning of the algorithm, each processing unit is available. The algorithm
takes the first available process from the queue and randomly selects a processing unit
for it. The scheduler checks the nodes which are depending on the scheduled one – if one
of them becomes available due to the procession of the node, it is pushed into the queue
otherwise the scheduler notes that one dependency for the node is ready. Next step is to
schedule the next available node from the queue. The algorithm stops, when all the nodes
are scheduled.
These algorithm is not very clever as random selection is not optimal – worst case
scenario is to schedule all the nodes to the same processing unit and leave others idle. It
does not consider the possible parallelization chances either. One improvement to support

33
parallelization is to select the next node from the available queue also randomly, which
creates the possibility to execute unrelated part graphs simultaneously.
2.3.3.2 Greedy scheduler
While random scheduler only uses the simulation to visit all the nodes, greedy
scheduler tries to leverage simulation of the execution more.[16]
It also maintains a list of
the available nodes and also adds the input nodes to an available nodes queue in its
initialization phase.
The main difference can be spotted in finding the next node to schedule. Let us
define variable 𝑇, which keeps track of the required amount of time for the schedule. At
the beginning of the simulation, the scheduler selects 𝑁 nodes and distributes them among
the processing units. This distribution can be random or based on several criteria. Some
implementations do the distribution based on the required time to execute (shortest job
first, longest job first) or on how long the node has been waiting to execute. Selection of
the processing unit is not random anymore. At a given point in time, only 𝑁 nodes can
run, as we have limited number of processing units. If we have unscheduled but available
nodes and do not have any available processing units, we increment variable 𝑇. The
simulation also has to know or at least to have some meaningful estimate, on how long
the execution of a particular node takes. Based on this information and the starting time
of a node, the simulator is capable to calculate when a processor will become free again.
It means that incrementing 𝑇 for a while will end up making a processor available again
so we can schedule the remaining nodes. This iteration is going to continue until all the
nodes are scheduled.
This method is better than the random scheduler as it tries to balance the usage of
processing resources. It makes parallel execution possible in a sense that parallel graph
parts can be executed simultaneously. It can also be implemented efficiently.
2.3.3.3 Critical path scheduler
This type of schedulers is originated from the critical path.[17]
Prerequisite is to
know the execution time of each nodes in the graph.
If we look at a schedule, usually we can reorder some tasks without modifying the
finishing time of the schedule, but there are tasks which cause delay in the execution if
they are executed later than expected. Critical path is a list of tasks which cause delay in

34
the schedule if started later than expected. The scheduler tries to focus on the critical path
and ensures that we have an available processor when a new critical element becomes
available.
To identify the critical path, we have to calculate some metrics for each node. Let
the EST (Earliest Starting Time) of the input nodes be zero. Every nodes following the
first one should be started as soon as it gets ready. EST should be the maximum of EETs
(Earliest Ending Time) of the dependency nodes. EET can be calculated from the EST
and the execution time of the node. Another important metric is LST (Latest Starting
Time). Let the LST of the output nodes be the EST of them. LST of the previous nodes
will be calculated based on the LST of the successor nodes, by subtracting the execution
time from the LST of the successor nodes and set the value for the minimum of the
calculated values. MDT (Maximum Delay Time) is calculated from EST and LST:
𝑀𝐷𝑇 = 𝐿𝑆𝑇 − 𝐸𝑆𝑇. If MDT of a node is zero, that means delay in its start will result in
delay in the whole schedule.
This class of schedulers tries to satisfy the starting time requirements, but if we
restrict the number of the processing units, it is possible that we must introduce some
delay to ensure the correct order of the execution. This scheduler gives very good results
and allows us to leverage parallel execution. Moreover it helps to calculate how many
processing units are needed to solve the problem effectively and can be visualized easily
using Gant diagrams.
2.3.3.4 Genetic algorithm scheduler
Genetic algorithm scheduler uses a very different approach to create a schedule:
it tries to model the evolution. [18]
Genetic algorithms use genetics, reproduction and
natural selection to converge to an optimal solution.
The previous algorithms tried to create the solution in one step. Genetic algorithm
initially creates a series of solutions usually randomly which is called the initial
population and improving them iteratively. In our case, a solution consists of a mapping
between the nodes, the processing units and can also include an order of the nodes. One
single solution is considered as a genome. The goal of the algorithm is to create the best
genome.
How can we measure the goodness of a genome? In genetic algorithm terms,
goodness is considered as fitness of the genome. Like the stronger beats the weaker in

35
evolution, more fit a genome is, more likely it survives to the next generation. In our
example, we can compare genomes by comparing the required execution time of the
encoded schedules. Lower execution time means higher fitness value.
Once we have the initial population, genomes can start reproducing. This process
is called crossing. As two genomes are paired, they can be combined into a new genome.
The offspring contains a random proportion of the information stored in its parents. We
select one random crossover point, where we split the parent genomes. In practice if we
store the mapping in an array, crossing means we take the first part of the array from the
first parent, the second part of the array from the second parent and creates a new array
from them. If we apply this method for the array where we store the order, we have to be
very careful and ensure that the order we create stays consistent with our restrictions.
After the offspring is born, the algorithm applies some random mutation. It is an important
step, as it helps the algorithm to overcome on local maximums/minimums and add more
variations which provide the chance to evolve. The mutation can improve and also
degrade the fitness of the genome.
This reproduction cycle continues until the population reaches a predefined size.
Once it happens, natural selection is performed among the genomes. Fitter genomes will
make it to the next generation while less fit genomes will drop out from the population.
The selection process has to ensure that a predefined count of genomes will survive the
natural selection. The new generation will start over the described cycle, will be
reproducing for a while and then natural selection will happen again.
Every generation is going to be at least fit as the previous one. The algorithm stops
after a predefined number of generations, a predefined time limit or a predefined
acceptable fitness of the generation.
Comparing to the previous algorithms, genetic algorithm is the most advanced
one, but also requires the most resources and takes a long time to complete. It is very
flexible, developers have a lot of choices to influence the algorithm: one can modify the
randomness in the mixing or in the pairing process, randomness of the mutation and its
value, the maximum size of the population, the threshold in the selection algorithm and
the number of the generations.
It has pitfalls as well, for example the population can stuck into a local optimum
or the extended resource and time requirements to create the schedule.

36
2.4 Compilers
When the computer is about to execute the program, it needs information which
it can understand. The set of the information which the CPU understands is called the
instruction set of the CPU. A program is executable on a given CPU, if there exists a
mapping between the language of the program and the instruction set of the CPU.
Compilers act as a bridge between the source and the destination set of symbols.
2.4.1 Compilation process
During the compilation process, the source code is transformed into byte code,
executable by the CPU. This flow consists of several components. The following picture
illustrates a possible process for C++ code compilation.[19][20]
2.4.1 – C++ compilation process
Developers need some tricks to make the development and maintaining process
easier. They can write comments into the source code or they can split source code into
smaller parts for example for clarity. Usually preprocessor is in charge to support these
features. It removes comments, reassemble divided parts by copying them together and
removes unnecessary clutter from the code which are important only for the developers.
Preprocessor also can take care of specific language features which are only syntax-
sugars. The output is a clear, compile-ready code.
The compiler transforms the input into the desired output language. In our case,
the compiler will take C++ code and compile it into Assembly language. It is important
to notice, that the process needs to know some parameters about the final system which
will run the program. The main advantage for using higher level languages is that by
introducing one more abstraction level, we define general mapping between higher level

37
constructs and lower level implementation. The program can focus on its own purpose
meanwhile compiler vendors take care about the compilation process for different
platforms.
At the end of the process, the assembler and the linker takes the compiled files
and creates an executable binary by setting the memory layout of the program, by linking
other libraries and by transforming them to binary code which is now platform dependent.
2.4.2 Compiler optimization
During the compilation, modern compilers applies some optimization.
There is a huge number of available optimization options for the most widely
used, free C++ compiler.[21]
Let us take a look on some examples, to see the choices a
developer has when optimizing the compilation process or the result.
 -fno-inline – tells the compiler not to expand any functions inline
 -fno-function-cse – makes each instruction that calls a constant function
contain the function’s address explicitly
 -fsplit-wide-types – if a type occupies more registers, split them apart and
handle independently
 -fdevirtualize – try to convert calls to virtual functions to direct calls
It is good to be aware of the actual optimization mechanisms, as they can cause
serious improvement as well as serious problems too. For example, compilers usually
eliminate variables which are never assigned in the code. However in hardware
programming if it is possible to set the value of the variable from hardware side as well,
eliminating the variable can cause unexpected behavior. To turn this feature off locally,
we can mark the variable as volatile.
2.4.3 Higher level languages
On the dawn of programming language evolution, creators of new languages
usually wanted to keep the possibility to reach hardware easily if needed – C and C++ are
very good examples for this. Later, as reoccurring tasks were discovered, people started
to create programming languages which fitted their needs better. As Assembly is quite
expressive in hardware level programming, functional programming languages like

38
Scala[22]
or F#[23]
are more expressive in their own fields. We can call these languages
higher level or domain specific languages[24]
.
Domain specific languages have multiple advantages. The most important one is
how natural a program can be in a well-designed domain specific language. Back to the
previous example, in the Assembly program you use more like registers and basic
mathematic operators, in C++ you call a predefined method of a virtual console. Behind
the scenes, we have a serious chance that with a good compiler, both programs compile
to similar byte code. No doubt, we could write thousands of compilers for every
programming languages, which compile them to bytecode, but in fact, compilers of
domain specific languages usually compile to the most similar language which already
has an existing compiler rather than reinventing the wheel.
Another strong argument for using DSL-s is the following: usually developers
create software based on someone’s requirements. A trader of a big bank presumably will
not be able to describe problems in an Assembly like environment. They speak about
bonds, trades, yields and other financial terms but not registers, jumps and functions. If
developers would have a set of tools in which they can express their users need and then
compile them to binary executables, development process would be more effective and
less expensive. There are many promising results in the field of very high level DSL-s,
but we are far from being able to solve every problem this way.[25]
While developing time and cost can be extremely reduced by using DSL-s, the
effort put into developing the DSL itself is huge. In the developer team at least one
member has to be a domain specialist to be able to identify the required features to
support. The set of the features can be huge and we have to filter wisely what to support
from them. It is also hard to decide how generic the new DSL would be. If it is too generic,
it may not be as useful as it could be. If it is too specific, we restrict ourselves to a very
small set of problems, which we can solve by the DSL. Also, if the DSL would be used
only once, most likely it will not worth the effort to develop.
2.5 Stack programming
Standard programming languages abstract away the details of the underlying
infrastructure and execution details, however many systems use stack-based execution
model under the hood. They operate upon one or more stacks.

39
Consider a recursive function call. When the function starts to execute, it sets up
an environment for itself, for example it allocates variables. This environment is
considered as the context of the function. When the first recursive call is made, the
function will execute again and setup its own context. Under the hood, the context of the
outer function will be pushed onto a stack, otherwise after the recursive call returns, the
variables would be overwritten. If recursion happens in multiple levels, we just need to
push more and more contexts to the stack. When the recursive call returns, just pop the
latest element from the stack and we get back the previous execution context. This
example is specialized to recursive calls, but it is easy to understand, every function works
the same way.
The data structure supports two basic operations. First one is the push operator,
which puts an element onto the top of the stack. Second operator is the pop, which takes
the top-most element of the stack.
In stack programming these operators can be considered as the assembly level
bricks. As we saw earlier, it can worth the effort to add higher level constructs to the
language, which helps developers to be more productive. Stack effect diagrams help to
define these extensions as they describe how the operator changes the state of the
structure.
Stack effect diagrams (SED)[28]
are usually given in the following standardized
form: opName ( before -- after ). Let us examine the pop method. Before executing the
operation, the stack contains 𝑁 data, after the operation it will only have 𝑁 − 1 elements.
SED of the pop operation is: pop ( a -- ).
There are some, widely used and known stack operators, those are introduced in
the following.
Op. name Stack Effect Diagram
dup ( a -- a a )
drop ( a -- )
swap ( a b -- b a )
over ( a b -- a b a )
rot ( a b c -- b c a )
Stack operators

40
Other stack operators can be easily created. Let us create a multiply operator,
which takes the top two element, multiplies them and pushes back the result to the stack.
SED for this operation looks like: mul ( a b -- c ).
2.5.1 Reverse Polish Notation
Reverse Polish Notation (RPN)[29]
switches the order of the arguments and the
operators in an expression. As it will turn out soon, RPN is essential to leverage stack
programming when solving mathematic problems.
Let us assume we want to calculate the following expression: (2 + 3) ∗ (4 + 5).
Humans usually solve the two addition first, then multiplies the result. For a computer the
given expression is a little difficult to solve. When the interpreter sees the + sign, the
second operator is not yet available. However in this simple case it would be possible to
implement the addition operator to solve this issue, but in general the solution is the RPN.
Rewrite the expression based on RPN: 2 3 + 4 5 + * . This way when the execution
gets to the addition, required data for the operator will be in place.
2.5.3.1 – Stack based execution of (2+3)*(4+5)
RPN has a direct sibling in graph theory. Traversal of the graph, if it is a binary
tree, has three simple variations. Starting from the root of the tree and applying one of the
following rules recursively, it is guaranteed to visit all the nodes.
1. First visit the node itself, then visit the left child and then the right child
2. First visit the left child, then the right child, then the node itself
3. First visit the left child, then visit the node itself, then visit the right child
Consider the following graph, which describes the expression above.

41
2.5.3.2 – Graph representation of (2+3)*(4+5)
Applying the third rule and log out the content of the nodes when visiting them
will result the original expression. Second rule produces the RPN form of the expression,
while the first one is the Polish Notation form. This mapping will be very useful, when
modelling mathematical problems in this paper.
2.5.2 Forth
Forth[30]
is an imperative stack-based computer programming language and
programming environment. It supports structured programming, reflection, concatenative
programming and extensibility. The environment can be used to execute stored programs
as well as an interactive shell. Forth is not as popular as other programming languages or
environments, but sill used in some operating systems and space applications.
Forth operates with words (subroutines). Implementations usually have two
stacks. First stack is used to store local variables and parameters, named data stack.
Second one is the function-call stack, which is called linkage or return stack.
Let us take a look at a self-defined word in Forth:
: FLOOR5 ( n -- n’ ) DUP 6 < IF DROP 5 ELSE 1 – THEN ;
Here we define a new word (FLOOR5). As the stack effect diagram shows it will
manipulate the top element of the stack by replacing it with something else. The
remaining part of the expression is the body of the method. DUP word duplicates the top
element, then 6 pushes a new element (6) into the stack. < compares the top two elements
of the stack and replaces them with a true or false value. IF and THEN are the usual
conditional branch statement. In the true branch, it replaces the original element with a
value of 5, while in the false branch, it pushes a value of 1 into the stack and then subtract
it from the original value.
Execution of a Forth program is as simple as possible. The interpreter reads a line
from the user input and tries to parse it. When the interpreter finds a word, it looks it up

42
in the dictionary for the associated code and executes it. If the word cannot be found in
the dictionary, the interpreter assumes it is a number and tries to push it into the stack. If
both try is failed, execution of the code is aborted. When defining a new word, Forth
compiles the word and makes the name of the word findable in the dictionary.
Stack based programming languages, especially Forth inspired me when
designing the execution model used to run an executable graph.

43
3 Design and implementation
In this chapter I set up and analyze the domain problem of my thesis and also
design and implement a solution for it. I explain the solution step by step starting with the
problem definition and provide usage and implementation details for each step in the
process.
3.1 Business problem
Financial sector is living a high pace lifestyle. Markets and regulations are rapidly
changing and financial crisis just boosted this effect. Regulatory presence is common and
supervisions happen every other day especially since the financial crisis.[31]
In such
environment performance, reliability and maintenance are essential parts in every systems
developed.
3.1.1 Morgan Stanley
Morgan Stanley[1]
is one of the biggest financial services corporation in the world.
It operates in more than 40 countries, with more than 1300 offices and 60000 employees.
History of Morgan Stanley[2]
is dated back to 1935 when some J.P. Morgan employees,
namely Henry S. Morgan and Harold Stanley decided to start a new firm.
Morgan Stanley splits its business into the following categories:
 Wealth Management
 Investment Banking & Capital
Markets
 Sales & Trading
 Research
 Investment Management
Although it is a financial company, it has an outstanding IT department which
supports everyday operation. Even if the systems inside the company cannot harm
humans like an airplane software if it is malfunctioning, IT department has to consider
other important factors. For example, a bug in a software on the trader floor, which
3.1.1 – Morgan Stanley

44
prevents the trader from doing business causes profit loss to the firm. The actual value of
the loss depends on various factors, but in general we speak about millions of dollars.
3.1.1.1 Team work
It is an unconcealed fact that Morgan Stanley seeks the best talents in every area
where they operate. They are actively participating in the education, announcing
internship programs and fresh graduate programs. For example in Hungary, their office
mainly acts as a back office, they do IT developments and have some accounting related
tasks there, but in the last couple of years, they extended the number of their employees
from 100 to over 1000.
Knowing the size of the firm, the complexity of their systems and the fact they are
committed to team-working, they also try to prepare future candidates to work as a team.
As this paper is supported and inspired by the firm, during my studies and work I also
participated in team-work.
In the beginning our main method was to consult the basic ideas, problems and
principles onsite with our consultants and implement the solution separately. During the
first semester, we have identified a lot of interesting topics and distributed them between
us. After this point, consultations were similar to scrum meetings, where we presented
our progress, ideas, problems and got directions if we were stuck.
3.1.2 Pricing
In the financial world, pricing is one of the most important notion.[32]
As the
proverb says: “Buy cheap, sell high”. Process of defining value for cheap and high is the
pricing process. We can price anything – stocks, bonds, options, gold, silver, cars, flats,
vacations. Markets even price possible defect of countries or changes in laws. It also
makes sense to create prices for different scenarios – if it is going to happen, the price is
𝑋 otherwise 𝑌.
Accuracy of pricing is essential. Yield of a trade is usually the bid-ask spread,
which is the difference between the bid (buy) price and the ask (sell) price. If we speak
about trading, another important aspect is the speed of the pricing. As seller’s or buyer’s
will is available to every market participants, the quicker response to a request can
improve the odds to make a good deal. Speed of pricing gets even more important, when
talking about automatic trading, where computers trading against each other.

45
Pricing is also a technical challenge. There are a huge number of available assets
to sell or buy on the market, most of them are correlated to some other assets. Data is live,
which means it ticks every seconds or even more frequently.
The problem I try to solve in this paper is to model connections between the assets
and develop an efficient and effective pricing implementation.
3.2 Modelling the problem
The problem to solve is defined in native English. To be able to analyze and solve
it, I have to model the problem using computer science terms.
Pricing algorithms are usually considered as business secret, so it is almost
impossible to find any information about how big companies implemented their own
version. Even Morgan Stanley could not provide me anything about their implementation.
Probably every big company use the same algorithm tailored to their own directives. In
such competitive environment this conduct is absolutely understandable but also makes
harder to position my results.
3.2.1 Graph representation
When representing networks, dependencies and connections, graph is the obvious
choice. As we want to model the pricing of assets and the connections between them, it
is easy to identify a mapping: nodes of the graph will be the assets we want to price and
edges will represent connections. Mapping of the edges carries information in its own,
therefore edges are directed.
3.2.1.1 Challenges and constraints
Connections between assets can be really complex. Rather than allowing
expressions to be complex, I have decided to force users to build up complex expressions
from simple bricks, like addition, multiplication and some other basic mathematic
operations. On one hand, this makes the input graph bigger and harder to create, but the
user can reuse partial results preventing recalculation of some common values.
As mentioned before, pricing can get extremely complex. For example, it is quite
usual that one asset influences another asset, but that asset also influences the first one
directly or indirectly. In this case, we have a circle in the graph, which filters out a number
of usable methods. To sort this problem out, in the beginning we are going to allow only

46
DAGs as input. Later on the solution can be extended to support circles as well, since
solutions exist to slice up a graph which contains circles subgraphs which are DAG and
later on connect them again without messing up the results.
3.2.1.2 Implementation details
To represent a graph on a computer, we can store the list of the edges, store the
adjacent matrix of the nodes or use other storing mechanisms. In the first case, we store
a list for each nodes, in which we store every edges which contains the given node. In the
second case, we store a 𝑁 ∗ 𝑁 matrix, where 𝑁 is the number of the nodes. Every element
of the matrix represents one possible edge between the nodes. Both approaches has its
own advantages, but in our case storing the adjacent matrix is not efficient. Our domain
problem implies that our graph is DAG. If we use adjacent matrix, we have to store an
entry for each possible edges even if we do not want create it. Due to the nature of the
domain problem, we do not have all the edges, which means unnecessary memory
consumption.
Thinking ahead, a node itself has to know which edges are directed to it. From the
implementation point of view, we get some benefits if the node also knows which edges
are directed out from it. Keeping this in mind, I have decided to implement a hybrid
solution, which tries to unify the advantages of the concepts. My implementation creates
one list for the nodes and one for the edges. These lists help to find any element in the
graph quickly. A node itself contains two lists – one for the incoming edges and one for
the outgoing edges. Finally an edge contains the two nodes it connects, distinguishing
source and target.
3.2.1.2 – Graph structure

47
Using this structure, navigating in the graph is very quick, easy and efficient. On
the other hand, the structure complicates operations like adding or deleting nodes or
edges. For example, when we delete a node, we have to maintain multiple vectors, which
is clearly an overhead. Comparing the amount of the advantages and the drawbacks, the
structure is quite sufficient as most of the time the graph will execute rather than change.
The architecture is designed to be as lightweight as possible. The implementation
uses pointers everywhere it is possible. More specifically, we use pointer types from the
C++11 STD library – shared, weak and unique pointers.
3.3 Building up a graph
Using a simple example, all the details of the model and the graph building process
is illustrated.
3.3.1 Quadratic equation
For simplicity we aim to solve the quadratic equation in our example.
𝑦1,2 =
−𝑏 ± √𝑏2 − 4𝑎𝑐
2𝑎
We can consider 𝑎, 𝑏, 𝑐 as inputs and 𝑦1, 𝑦2 as outputs. One execution of the graph
is going to solve the equation for us.
3.3.2 IProcessable interface
In our model every element is capable to run, this is true for nodes and for edges
as well. This observation leads us to leverage object oriented paradigm and create a base
class for all the runnable elements in the graph.
In C++ we create interfaces as classes since the language does not contain the
interface keyword. The IProcessable base class has two main functions: to store the value
of a node and to store the function which we will fire during the execution. In this model,
the function we execute is the process( ) function. To ensure that every derived classes
implement their own function to run, I marked the process function as abstract.

48
3.2.2 – Basic hierarchy of the graph elements
The signature of the function is empty, because the arguments of the function will
be the predecessors of an element. It means flexibility since instead of restricting the
signature we leave argument handling to the function itself. Although variable length
parameter list is available in C++, it would be much harder to correctly design the
signature of the process function and to ensure every possible combinations than checking
predecessors in the beginning of each functions and decide whether they can be
considered as a valid input.
The Value property can be used to store information which is required for the
element. Let us say we want to create edges, which not only transport the value from one
node to another, but also multiply the carried value. If we want to create an edge which
doubles the value and an edge which triples the value, we can create two separate classes,
but then we duplicate the code. If we create just one class and store the multiplier as a
variable, we can easily modify the multiplication factor for each edges without creating
unnecessary complexity in the code base.
Node and Edge classes also have default implementations and should be
specialized further. Their properties provide the functionalities I described earlier –
namely a node stores every predecessor and successor edges in two separate lists and an
edge stores the nodes it connects to each other. This structure stores the direction of the
edge: From property is the source and To property is the destination node.
3.3.3 Node types
Node is a very generic class. It only knows its predecessors and successors. I have
defined some node types which I can solve mathematic problems with. The type system
is easily extensible, so further types can be created quickly and easily.

49
Addition node
This type does the basic mathematic addition. Takes the values of the predecessors
and sums them up, then notifies the successors about completion. Value property is not
used.
Constant node
This type acts as a constant input. It never has any predecessors and it is always
ready to process. It uses the Value property to store the predefined value. In its process
function it does nothing but notifies the successors about its value.
Division node
This type realizes the division operation. It waits for two predecessors, first one is
the dividend, and second one is the divisor. Value property is not used.
Multiplication node
This type does the basic mathematic multiplication, implementation is very
similar to the addition type node.
Square root node
This type calculates the square root of a number. It takes one predecessor. Value
property is not used.
3.3.4 Edge types
Edge class is also a generic base class. It stores the source and the destination
nodes. The implemented function is very simple, takes the result value from the source
node, stores it and notifies the destination value.
3.3.5 Graph object
Nodes and edges are self-descriptive, but it makes sense to encapsulate coherent
parts.
3.3.5 – Graph structure

50
The class itself tries to be as simple as possible. To support previously described
functionalities, nodes and edges are stored in separate lists. At this point, we only need
two public methods, namely add a node and add an edge to the graph. Each function takes
a node or an edge as argument appropriately. I have decided to decouple the creation of
nodes and edges from the graph. This separation helps not to pollute the graph object with
the details of how to create a node or an edge. Creation of elements can be done via a
factory class which can take care of the creation of separate types of nodes and edges.
3.3.6 Example graph
With the acquired knowledge, we can now setup our example graph. First step is
to separate the starting equation to express the two results separately. Doing that, we get
the following two expressions for the results:
𝑦1 =
−𝑏 + √𝑏2 − 4𝑎𝑐
2𝑎
𝑦2 =
−𝑏 − √𝑏2 − 4𝑎𝑐
2𝑎
To be able to express the RPN form of the expressions, rewrite the formulas and
then create the RPN form:
𝑦1 = (−𝑏 + (𝑏2
− 4𝑎𝑐)0.5
)/2𝑎 𝑦2 = (−𝑏 − (𝑏2
− 4𝑎𝑐)0.5
)/2𝑎
𝑦1 = (−𝑏) 𝑏2
4𝑎𝑐 − √+2 𝑎 ∗/ 𝑦2 = (−𝑏) 𝑏2
4𝑎𝑐 − √−2 𝑎 ∗/
Using the RPN form of the expression, it is easy to create the graph of the
expression. During the build process if we encounter a subexpression which we need and
has been already calculated and has not changed since then, we can reuse it to prevent
unnecessary recalculation of values. Using these observations and expressions, the
following graph can be constructed.

51
3.3.6 – Graph of the quadratic equation
In the picture each colors have different meaning. Blue nodes are input nodes.
Pink nodes are constant nodes with predefined values (nodes with the value of -1 are
identical, displayed twice only in order to simplify the picture). Yellow nodes are
operators of type addition, multiplication, division and square root. Output nodes are
marked with green.
C++ code for creating this graph can be found in the Appendix.
3.4 Simple execution models
Once we have the model setup, the next step is to run it and acquire the results.
Before designing an advanced model, let us take a look, how easy it is to express the
notion “running” in our current model.

52
3.4.1 Recursion based execution
As we discussed earlier we have a simple mapping between the RPN form of an
expression and its graph representation. We also saw an example on how to solve an
expression using its RPN form. In this section, we are going to implement a solution using
the graph representation.
3.4.1.1 Recursion
Using recursion[33]
, we can easily iterate through all the elements of the graph and
calculate the final result. To start the execution we have to call the process function of an
output node as recursion works backwards.
To process a node, we simply take the result of all the predecessor nodes, execute
its own calculation and return. Getting the result of a predecessor element causes
recursion.
Each recursive algorithm needs a stop condition. In our case it is going to be hit
on a constant node – when we reach one, we do not need to call any further process
functions of predecessors, we can return with the value of the node.
Each element of a graph is derived from IProcessable base class hence have a
process( ) function. To support recursion based execution I had to ensure that process
function of any IProcessable causes calls to the process functions of each predecessors.
Underlying implementation of process function calls the GetResultValue( )
function of its predecessors. It basically means that when calling the GetResultValue
function, the element itself does not know the call will cause a recursion or will just
simply return with the result value. This fact comes handy when dealing with multiple
outputs. GetResultValue function is implemented in way that it only causes recursive calls
when it is called the first time. When we have multiple outputs, much likely they use
common part-graphs, so rather than recalculating the whole part-graph again, we can
reuse intermediate results.
To get the results, we have to call the GetResultValue( ) functions of the output
nodes. I consider every node without a successor as an output.

53
3.4.1.3 Experiences
Implementing this version pointed out that using higher level languages makes
development much easier as notion like “process” can be expressed naturally via well
designed objects.
This version is the first one I implemented so it is really hard to judge the
performance but in theory I can make provisions.
First barrier of usage can be memory limitation. This type of recursion, where we
have to keep the original calling context of a recursive call, can quickly lead to out of
memory exceptions, especially for big graphs. Other issue could be the speed of the
execution. Knowing that we have a complex data structure for storing a graph and also
have a complex class hierarchy, I assumed that using wiser execution algorithms can have
outstanding performance improvement comparing to the recursion based execution
model.
3.4.2 BFS based execution
From a certain point of view, BFS-based execution is the opposite of the recursion
based model. While using recursion means we automatically know which element is
ready to be processed, if we go the other way around and start from input nodes, we have
to monitor, which element is ready to be processed.
Remember that we are working on DAGs, which means a BFS traversal ends up
in a valid order of execution, if we apply a little modification in the algorithm.
In BFS, after visiting a node we add all unvisited neighbors to the ToProcess list.
In our case, we need to ensure that an element is added to the list only if all of its
predecessors are ready. To support this validation, I have implemented two functions.
The isReadyForProcess( ) function checks that every prerequisite of the element is ready.
Purpose of registerParameter( ) function is to help determine the actual state of an
element. When an element is processed, it notifies its successors about this fact by calling
the registerParameter function of each successors. The function inside maintains a
counter and increments it on every call. The checking function compares this counter with
the number of predecessors and if they equal, the element is ready to be processed.

diploma

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (8)

Similar to diploma

Similar to diploma (20)

diploma