Workflow management solutions: the ESA Euclid case study

University of Trieste
DEPARTMENT OF ENGINEERING AND ARCHITECTURE
Master degree in Computer and Electronic Engineering
Workﬂow management solutions:
the ESA Euclid case study
Candidate
Marco POTOK
Matricola IN2000004
Thesis advisor
Prof. Francesco FABRIS
Thesis co-advisor
Dott. Erik ROMELLI
Academic Year 2017-2018

Contents
Introduction iii
1 Workflow Management 1
1.1 The Big Data Era . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Arise of e-Science . . . . . . . . . . . . . . . . . . 3
1.2 Workflow Manager . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Towards Scientific Workflow Managers . . . . . . . . . . . . . 5
2 Euclid Mission 8
2.1 Mission objective . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Spacecraft and Instrumentation . . . . . . . . . . . . . 10
2.2 Ground Segment Organization . . . . . . . . . . . . . . . . . . 13
2.3 SDC-IT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Software Development 17
3.1 Development Environment . . . . . . . . . . . . . . . . . . . . 17
3.2 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Multi-Software Catalog Extractor . . . . . . . . . . . . . . . . 22
3.4.1 Developing and Versioning Methods . . . . . . . . . . . 23
3.4.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.3 Deblending . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.4 Photometry . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Utils Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.1 The Masking Task . . . . . . . . . . . . . . . . . . . . 30
3.5.2 FITS Images . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.3 SkySim . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Pipeline Project . . . . . . . . . . . . . . . . . . . . . . . . . . 33
i

CONTENTS ii
4 External Workflow Managers 36
4.1 Luigi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Pipeline Definition . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Integration with Elements . . . . . . . . . . . . . . . . 38
4.2 Airflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 DAG Definition . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Pipelining Elements Projects . . . . . . . . . . . . . . . 40
5 Comparison 42
5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Top Based Profiling Tool . . . . . . . . . . . . . . . . . 57
5.4 CPU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.1 EPR Installation and Configuration . . . . . . . . . . . 75
5.6.2 Luigi Installation and Configuration . . . . . . . . . . . 76
5.6.3 Airflow Installation and Configuration . . . . . . . . . 77
5.7 Distributed Computing . . . . . . . . . . . . . . . . . . . . . . 78
5.8 Workflow Visualization . . . . . . . . . . . . . . . . . . . . . . 79
5.9 Framework Integration . . . . . . . . . . . . . . . . . . . . . . 85
5.10 Triggering System . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.11 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Conclusions 87
Acronyms 90

Introduction
Space missions, and scientific research in general, need to handle an ever-
increasing amount of data, recently stepping de facto into the big data terri-
tory. An infrastructure capable of managing this volume of data is required,
including a proper workflow manager. Adopting an external tool, and thus
avoiding the need to develop in-house software, can represent a time saving
choice that allows to divert more resources towards science related activities.
Luigi and Airflow are two workflow managers developed in an industrial
context, respectively by Spotify and Airbnb. They are software tools made
available to the public by means of an open-source license which, as a re-
sult, has allowed them to be constantly maintained and improved by a large
community.
The aim of this thesis is to test the feasibility of using Luigi or Air-
flow as workflow manager within a large scientific project such as the Euclid
space mission, which is the case study of this work. The test consists in a
comparison between the two workflow managers and the one currently used
by the Euclid Consortium developers: the Euclid Pipeline Runner (EPR).
The comparison is interesting for the scientific community since, through
that, it can have an overview of the already available tools, if they fit their
requirements and offer the necessary features. This can be a valuable oppor-
tunity to estimate the overall performances and characteristics a workflow
manager can provide, promoting the decision to add some components to
the in-house design or even adopt the external tool completely. In order to
capture the behavior of these tools during their operation, it is necessary to
iii

INTRODUCTION iv
define a pipeline. For this reason, the first step of this work was become
familiar with the Euclid development environment, within which, later, four
software projects were created following the constraints of the mission frame-
work. Afterwards, an Euclid-like scientific pipeline was built by means of the
four projects, obtaining a first pipeline running thanks to EPR. Other two
pipelines were then built, one for Luigi and one for Airflow, in order to obtain
all the necessary elements for the comparison.
The comparison between Luigi, Airflow and EPR was performed by means
of ten metrics, chosen based on the main needs expressed by the developers
of the euclidean environment:
• Execution time
• RAM memory usage
• CPU usage
• Error handling
• Usability and configuration complexity
• Distributed computing
• Workflow visualization
• Integration in a framework
• Triggering system
• Logging quality
The results obtained seem encouraging and suggest that the external work-
flow managers can actually be used within a space mission environment,
bringing some performance improvement and offering several additional fea-
tures compared to EPR. This work is structured as follow: in Chapter 1
an introduction to workflow managers will be made, explaining the need
to use them to manage large amounts of data and what role they play in
the scientific field. In Chapter 2 an overview of the Euclid mission will be
framed, touching its scientific objectives and the organizational structure be-
hind the part of the mission that handles the scientific data. Chapter 3 will
be dedicated to describing the development environment of the mission and

INTRODUCTION v
the development phases of the software realized for this work. In Chapter 4
Luigi and Airﬂow, the external workﬂow managers, will be introduced, along
with their main features and characteristics. Finally, Chapter 5 will focus on
the description of the comparison metrics and the results obtained.

Introduzione (Italian)
Le missioni spaziali, e in generale la ricerca scientifica, devono gestire una
quantità sempre maggiore di dati, entrando di recente nel territorio dei big
data. La gestione di un tale volume di dati presenta delle sfide che devo-
no essere affrontate con le giuste infrastrutture e disponendo degli strumenti
software appropriati, come, ad esempio, un workflow manager. Adottare uno
strumento esterno, evitando quindi di dover sviluppare in casa il software,
può rappresentare una scelta in grado di far risparmiare tempo che può es-
sere dedicato di conseguenza alle attività prettamente scientifiche. Luigi e
Airflow sono due workflow manager sviluppati in un contesto industriale, ri-
spettivamente da Spotify e Airbnb. Essi sono degli strumenti software resi
disponibili al pubblico per mezzo di una licenza open-source che, di con-
seguenza, ha permesso loro di essere mantenuti e migliorati da un’ampia
comunità.
Lo scopo di questa tesi è quello di valutare la possibilità di utilizzare Luigi
o Airflow come workflow manager all’interno di un grande progetto scientifico
come la missione spaziale Euclid, caso di studio per questo lavoro. L’ana-
lisi consiste nel confrontare i due workflow manager con quello attualmente
utilizzato dagli sviluppatori dell’Euclid Consortium: l’Euclid Pipeline Run-
ner (EPR). Questo confronto risulta interessante per la comunità scientifica
poiché permette di ottenere una panoramica degli strumenti già disponibili,
stabilire se essi rientrano nei requisiti del sistema ed offrono le caratteristiche
necessarie. Inoltre, risulta una preziosa opportunità per valutare le carat-
teristiche e le prestazioni complessive che un workflow manager può offrire,
vi

INTRODUCTION vii
fornendo dei dati a supporto della decisione di adottare uno strumento ester-
no o aggiungere alcune componenti software al proprio progetto. Per ottenere
dei dati riguardo il comportamento di questi strumenti durante la loro ese-
cuzione, è necessario definire una pipeline. Quindi, il primo passo di questo
lavoro è stato quello di acquisire confidenza con l’ambiente di sviluppo di Eu-
clid, all’interno del quale sono stati successivamente creati quattro progetti
software rispettando i vincoli del framework della missione. Tali progetti so-
no stati poi combinati in cascata, ottenendo una prima pipeline Euclid-like
eseguibile con EPR. In seguito sono state costruite altre due pipeline, una
per Luigi e una per Airflow, cos`ı da disporre di tutti gli elementi necessari
per effettuare il confronto.
Il confronto tra Luigi, Airflow e EPR è stato condotto per mezzo di dieci
metriche, scelte in base alle principali necessità espresse dagli sviluppatori
dell’ambiente euclideo:
• Tempo di esecuzione
• Utilizzo della memoria RAM
• Utilizzo della CPU
• Gestione degli errori
• Complessità di utilizzo e configurazione
• Capacità di operare in modalità distribuita
• Visualizzazione del flusso di lavoro
• Integrazione in un framework
• Capacità di avvio automatico delle esecuzioni
• Qualità dei log generati
I risultati ottenuti sembrano incoraggianti e suggeriscono che i workflow
manager esterni possano essere di fatto utilizzati all’interno dell’ambiente
di sviluppo di una missione spaziale, portando qualche miglioramento nelle
prestazioni e offrendo alcune caratteristiche aggiuntive rispetto ad EPR.
Questo lavoro è strutturato nel modo seguente: nel Capitolo 1 verranno
introdotti i workflow manager, spiegando le necessità che spingono al loro uti-

INTRODUCTION viii
lizzo e il ruolo che essi possiedono nel campo scientifico. Nel Capitolo 2 sarà
proposta una panoramica della missione Euclid, descrivendo i suoi obiettivi
scientifici e la struttura organizzativa che si occupa della gestione dei dati
scientifici. Il Capitolo 3 sarà dedicato all’ambiente di sviluppo della missio-
ne e al software sviluppato durante questo lavoro. Nel Capitolo 4 verranno
introdotti i workflow manager esterni, Luigi e Airflow, descrivendo le loro
principali caratteristiche. Infine, il Capitolo 5 sarà dedicato alla definizione
delle metriche di confronto e all’esposizione dei risultati ottenuti.

Chapter 1
Workflow Management
Data is a new extremely valuable resources and is collected in an increas-
ing pace, but the true value is represented by the information enclosed inside
it. For this reason, a wide range of new tools for big data processing have
been developed in the last years. A subset of these tools are called workflow
managers and they are in charge to coordinate the data processing steps.
Automation capability and fault tolerance are the main required features,
characteristics to be implemented in a distributed system. Also in the sci-
entific community we can observe an increasingly adoption of large volume
data acquisition [1]. This also apply to the field of astronomical research, the
area in which this work has been carried out.
In fact, the rapid evolution in computer technology and processing power
boosted the design of more and more complex surveys and simulations. For
instance, it was gradually possible to add extra dimensions to the data col-
lected, such time or a third spacial dimension [2]. This extra dimension can
be obtained by repeating views of the same object in order to spot transient
phenomena, or can be a 3D scan, mapping the sky along the depth axis, as
Euclid will be able to do.
As has often happened in the history of modern astronomy, also in this
1

CHAPTER 1. WORKFLOW MANAGEMENT 2
decade we are witnessing an explosion in the volume of the datasets used for
astronomy and the amount of data has increased by an order of magnitude
compared to just few years ago. Statistical analysis is now essential to make
new discoveries obtained thanks to the correlation of a large volume of data,
impossible to process with legacy methods where often wasn’t even used an
information system.
1.1 The Big Data Era
With the increasing demand of more and more information to improve
the accuracy of scientific research, the world of astronomy has face data
management problems that the industrial world has already begun to solve.
This new type of data is part of the phenomenon called big data, although
a precise definition of this entity has not yet been established. Big data are
identified through their characteristics, among which five are widely accepted:
Volume, Velocity, Variety, Veracity and Value, the 5 Vs of big data.
• Volume: refers to the amount of data collected that can no longer be
stored in a single node, but a complete system must be set up for the
correct and efficient management of the data. Furthermore, a structure
in the data is not guaranteed anymore and relational databases1
are no
longer the best choice, resulting in an increase in database management
systems that are no longer relational but which imitate an hash table
structure.
• Variety: represents the lack of homogeneity of the collected data,
coming from different sources, unstructured or semi-structured. These
types of data need a more intelligent processing chain that can adapt
to the case.
1
A relational database stores data in tables consisting in columns and rows. Each
column stores a type of data. Data in a table is related with a key, one or more columns
that uniquely identify a row within the table.

• Velocity: refers to the volume generated per unit of time and also the
rate at which data must then be processed and made available. With-
out an appropriate distributed infrastructure it would be impossible to
carry out such a difficult task.
• Veracity: represents the guarantee that the data are consistent, reli-
able and authentic.
• Value: refers to an added value that without the use of data with the
previous characteristics, it would not be possible to obtain. More data
implies more accurate analysis and more reliable results.
1.1.1 The Arise of e-Science
Whatever the final purpose of the research, from exploring the extremely
small to mapping the vastness of the visible Universe, the aspects of data
management, their analysis and their distribution, are increasingly predom-
inant within scientific experimentation. The science that produces the large
amounts of data that effectively possess the characteristics of big data is
called e-Sciences, a term coined in the UK in 1999 by John Taylor, then
general director of the Office of Science and Technology, who faithfully antic-
ipated the direction of technological development that would have undertaken
the scientific field from that moment on. e-Science is therefore the technolog-
ical face of modern science, which produces and consumes large amounts of
data and, for this reason, must be supported by an adequate infrastructure
for storing, distributing and accessing the collected data. This infrastruc-
ture is often called Scientific Data e-Infrastructure (SDI). Meanwhile, the
term Cyberinstrastructure was created in the United States, which describes
the same information and infrastructural needs that e-Science implies [3, 4].
This shows how the phenomenon developed at the end of the 1990s and early
2000s was in fact involving a large part of the scientific community. From
that moment on, the demand for systems with improving performances and
capable of handling an ever-increasing data volume has become progressively

more important. Adopting the Big Data paradigm for science was possible
thanks to the change in mentality started with e-Science, maturing in an
improvement of scientific instruments capable of collecting huge volumes of
data and of the SDI infrastructure capable of distributing them appropriately
[3].
1.2 Workflow Manager
New software tools, new architectures and new programming paradigms
have been developed for management and processing of large amounts of
data produced in response to new scientific and industrial needs. A subset of
the tools developed in this ecosystem are the workflow managers, employed
to build systems capable of working in a distributed environment and robust
enough to carry out their task without causing a complete stop of the system
in case of partial failure. A workflow manager is a software tool that helps
to define the dependencies among a set of software modules or tasks. We can
identify two main jobs the workflow manager has to accomplish: dependency
resolution and task scheduling. The dependency resolution is essential to
schedule the tasks in the right order and make sure every module is run if
and only if all its dependencies are completed successfully. The scheduler
has to decide when each task should be executed in order to optimize the
available resources usage.
1.2.1 Data Pipeline
A data pipeline is a task concatenation, where, generally speaking, the
execution’s result of a module becomes the input for the next one. In this
way, the modules can be developed independently and in a modular fashion,
where the only requirement to meet is the interface defined between the two.
This interface can be as simple as the information about the type of file and

its location in the file-system. Two distinct approaches can be identified in
defining the pipelines and their workflows: the business and the scientific
one.
• A typical business workflow has features such as efficiency in execu-
tion, independence between different workflows and human involvement
in the process. Another characteristic of this approach is that a pipeline
is defined through a control-flow, i.e. the dependencies between tasks
are based on their status. For example, if a task X is dependent on task
Y, X is not executed until Y is in completed state. In the end, typically
the data are not streams, pipeline execution is not continuous but on
demand when there is a need.
• A scientific workflow has the task of producing outputs that are
the result of experimentation, the instances of different workflows are
in some degree correlated with each other and the automatism is ex-
ploited as much as possible. Although automation is important, so it
is the possibility to have access to intermediate results that an expert
can validate on the fly. The pipeline is focused on the data-flow and
no longer on the control-flow, i.e. a task is not executed until its in-
put is available. This approach is therefore called data-driven. The
data flow is described by a Directed Acyclic Graph (DAG) where each
node represents a task and the graph’s topological ordering defines the
dependencies. Data is often a continuous flow and all the tasks in a
pipeline are usually working on different data at the same time.
1.3 Towards Scientific Workflow Managers
For the reasons mentioned previously in this chapter, which include the
production of big data in the scientific field, there has been an increase in
the use of workflow managers in science, which is no longer conducted by the
individual but has become a joint effort of many organizations and national

institutes. A scientific workflow manager must be the mean that allows a
research team to obtain the results and therefore must be as transparent as
possible to the user. Among its objectives we can find [4, 5]:
• Description of complex scientific procedures, hence reuse of workflow,
along modularity in the task construction, become important.
• Automatic data processing according to the desired algorithms, and
possibility to inspect intermediate results.
• Provide high performance computation capabilities with the help of an
appropriate infrastructure.
• Reduce the amount of time researchers spend working on the tools,
allowing them to spend more time conducting research.
• Decrease machine time, optimizing software execution instead of in-
creasing physical resources.
To move towards these objectives, a huge amount of new tools have been
developed within the scientific field, including programming languages and
whole systems. For example, more than a hundred different custom pipeline
managers have appeared in a short time, making it difficult to port systems
and code, leading to a lack of result reproducibility [6]. This situation does
not help scientific discoveries, which sometimes lose the ability to be testable
from independent parties. A solution could be to identify a standard to
adopt, or to make the tools developed free and easy to use. A step in the
right direction, perhaps, could be to use more generic systems able to satisfy
specific needs. One sector that needs generic tools is the industrial one, which
has in fact produced highly efficient and easy to use systems. These tools
are often not adopted by the scientific community that relies on products
developed in-house. The result is that similar technologies in the scope are
created independently by science and industry, thus missing an opportunity
to share their capabilities and resources.

In this work we want to take the European Space Agency (ESA) Eu-
clid Space Mission [7] and his workflow manager as a case study and verify,
through an analysis of ten metrics, whether two popular tool for pipeline
management developed in the industrial world can be adopted within a space
mission. Contrary to what happens with scientific workflow managers, where
the purpose is to solve a specific instance of a certain problem, in the field
of IT companies and startups, the tendency is to take a path towards the
design of a software that is as generic as possible, flexible and adaptable to
different scenarios.

Chapter 2
Euclid Mission
The ESA’s mission Euclid is a medium class (M-class) mission, part of
the Cosmic Vision program. Its main objective is to gather new insights
about the true nature of dark matter and dark energy. In this chapter we
will explore the main mission characteristics and the basic science behind the
measurement of the universe.
2.1 Mission objective
Thanks to the ESA Planck mission1
, the researchers conﬁrmed that many
questions about the nature of our Universe are currently open. As we see
in ﬁg. 2.1 only the 4.9% of the matter we are surrounded by is Baryonic
Matter, i.e. what is commonly adressed as ordinary matter, such as all the
atoms made of protons, neutrons and electrons2
. Another 26.8% is Dark
Matter (DM), a component with high mass density that interacts with itself
an other matter only gravitationally. Moreover, it doesn’t interact with the
1
http://sci.esa.int/planck
2
Although the electron is a lepton, in astronomy it is included as part of the baryonic
matter because its mass, with respect to the mass of the proton and the neutron, is
negligible.
8

CHAPTER 2. EUCLID MISSION 9
Figure 2.1: Estimated composition of our Universe. Source:
http://sci.esa.int/planck.
electromagnetic force, making it transparent to the electromagnetic spectrum
and really hard to spot. The remaining 68.3% is what it’s called Dark Energy
(DE) for its unknown nature [8]. This component it’s linked with the accel-
erated expansion of the Universe, but currently there is no direct evidence
for its actual existence. Different models have been developed to explain the
nature of the effects observed and currently attributable to dark energy. In
order to gather more data that could bring more insights about the nature
of dark matter and dark energy, the approach chosen is to observe and an-
alyze two cosmological probes: the Weak Lensing (WL) and the Baryonic
Acoustic Oscillation (BAO). The WL effect is caused by a mass concentra-
tion that deflect the path of light traveling towards the observer. This effect
is detectable only measuring some statistical and morphological properties
of a large number of light sources. Euclid is expected to image about 1.5
billion galaxies, capturing useful data to study the correlation between their
shape, mapping with high precision the expansion and grow history of the
Universe [7]. A BAO is a density variations in the baryonic matter caused
by a pressure wave formed in the primordial plasma of the universe. Mea-
suring the BAO of the same source at different redshifts allows to estimate

the expansion rate of the universe.
2.1.1 Spacecraft and Instrumentation
The Euclid space telescope is a 1.2 m Korsch architecture with 24.5 m
focal length. The spacecraft carries two main tools that will generate all
the data for the mission. They are both electromagnetic spectrum sensors,
one specialized for photometry in the visible wavelengths and the other for
infrared spectroscopy .
• VIS: or VISible instrument, will used to acquire images in the visible
range of the electromagnetic spectrum (550-900 nm). It is made of
36 CCDs, each counting 4069x4132 pixels (see fig. 2.3a). The weak
lensing effect will be measured through the data obtained thanks to
this instrument.
• NISP: or Near Infrared Spectrometer and Photometer has two com-
ponents: the near infrared spectrometer, operating in the 1100-2000
nm range and the near infrared imaging photometer, working in the
Y (920-1146 nm), J (1146-1372 nm) and H (1372-2000 nm) bands. It
is composed of 16 detectors, each counting 2040x2040 pixels (see fig.
2.3b). The main purpose of this instrument is to measure the BAO at
different redshifts.
The two instruments will have about the same field of view of 0.54 and 0.53
deg2
respectively, but VIS will offer a much greater resolution. Euclid has as
a requirement to do a wide-survey of at least 15,000 deg2
of sky, and possibly
reach 20,000 deg2
. In combination with this, another two deep-surveys of 20
deg2
each are planned [7].

Figure 2.2: Schematic ﬁgure of the Thales Alenia Space’s concept of the
Euclid spacecraft. Source: http://sci.esa.int/euclid.

(a) One of the 36 CCDs that will compose the VIS instrument. Source:
http://sci.esa.int/euclid. Copyright: e2v.
(b) One of the 16 CCDs that will compose the NISP instrument. Source:
http://sci.esa.int/euclid. Copyright: CPPM.
Figure 2.3: Actual ﬂight hardware for the Euclid spacecraft.

2.2 Ground Segment Organization
As it is in mostly every space mission, the Euclid mission has its own
space system made of three segments:
• Space Segment: which includes the spacecraft along with the com-
munication system.
• Launch Segment: which is used to transport space segment elements
to space
• Ground Segment: which is in charge of spacecraft operations man-
agement and payload data distribution and analysis.
Inside the Euclid ground segment we can distinguish two parts: the Opera-
tions Ground Segment (OGS) and the Science Ground Segment (SGS), the
latter managed in collaboration with ESA and the Euclid Mission Consor-
tium3
(EMC). Within the SGS, we can further identify three components:
• Science Operation Center (SOC): which is in charge of the space-
craft management and the execution of planned surveys.
• Instrument Operation Teams (IOTs): which are responsible for
instrument calibration and quality control on calibrated data.
• Science Data Centers (SDCs): which are in charge to perform
the data processing and the delivery of science-ready data products.
Moreover they have the job of data validation and quality control.
The amount of data expected from the mission, considering only the sci-
entiﬁc data, is roughly 100 GB per day with a total of 30 PB for the entire
mission. After all processing steps needed in order to obtain the science
3
https://www.euclid-ec.org

Figure 2.4: Data processing organization and responsibilities. All science-
driven data processing is performed by the nine SDCs. Source: Romelli et
al. [8], ADASS XXVIII conference proceedings (printing)..
product, the EMC predicts about 100 PB of data to handle [9]. For this rea-
son, it’s one of the first time a space mission begins a path of modernization
towards an IT infrastructure capable of handling big data. Euclid adopts a
distributed architecture for storing and processing data. Referring to Figure
2.4, we see how data processing and management takes place at nine sites in
different countries. Each site is an SDC that manages a part of the data pro-
cessing steps. The Organization Units (OUs) are working groups specialized
in different aspects of the scientific data reduction and analysis. Each SDC
is in support of one or more Organization Units (OUs), relation represented
in the figure with a continuous line. The dashed line represents instead a
deputy support. The data products generated throughout the mission are
categorized in five processing levels:

• Level 1: data are unpacked, decompressed as well as the associated
telemetry.
• Level 2: data are submitted to a first processing phase that includes
calibration step and a restoration one, where artifacts due to the in-
strumentation are removed.
• Level 3: data are in a form suitable for their scientific analysis and,
because of this, they are called science-ready data.
• Level E: external data coming from other missions or other projects.
Before being included in the processing cycle, they must be euclidised
to be consistent with the rest of the data.
• Level S: simulated data, useful in the period before the mission for
testing, validating and calibration of the systems developed for pro-
cessing the data.
In Figure 2.5, a diagram of how the data will be processed is shown. After
downloading the data, they are routed to the SOC that becomes the point
from which they are distributed. In the transformation of data from level 1 to
level 2, in addition to the data of VIS, and NISP (distinguished in NIR for the
photometric part and SIR for the spectroscopy one), external data are added,
coming from the so called level E. These data are coming from instruments
of other missions and this reflects the typical e-Science workflow. Equally
emblematic, it is the insertion of level S data, that is artificially generated
through simulations and have the purpose of testing the system before the
data coming from the satellite are available.

Figure 2.5: Simplified data analysis flow. After downloading the data from
the satellite, they are distributed to the SDCs via the SOC. The level S and
E data, characteristic traits of e-Science, can be spotted. Source: Dubath
et al. [9].
2.3 SDC-IT
The Italian Scientific Data Center (SDC-IT) is one of the nine SDCs
involved in the mission. It is located at the Astronomical Observatory of
Trieste4
(OATs) and plays a role of both primary and auxiliary reference for
some OUs [8]. The author has carried out this work at the SDC-IT that has
made available its structure and its expertise, along with the software tools
used within the space mission. Euclid has its own development environment
called Euclid Development ENviroment (EDEN), which will be used at each
stage of this thesis development. In chapter 3 it will be described in detail,
as well as the software produced by the author.
4
http://www.oats.inaf.it

Chapter 3
Software Development
As first part of this thesis work, three main activities were carried out: the
first was to become familiar with the development environment used in the
context of the mission. The second activity was to implement three software
modules compliant with the Euclid rules that could be the foundation for
an elementary scientific elaboration of astronomical images. In conclusion,
the third activity involved the creation of a pipeline wrapping the modules
developed in the second phase. The result was then an Euclid-like scientific
pipeline running thanks to the workflow manager Euclid Pipeline Runner
(EPR), designed by the EMC. The development environment used was the
default one predisposed for the mission and it will be quickly illustrated in
this chapter. Subsequently the program used and the code produced will be
extensively described. All the software was written in Python 3, as foreseen
by the official Euclid coding rules.
3.1 Development Environment
The success of the Euclid mission depends on the collaboration of dozens
of public and private entities, involving hundreds of people. To ensure all
17

CHAPTER 3. SOFTWARE DEVELOPMENT 18
Feature Name Version
Operating System CentOS 7.3
C++ 11 compiler gcc-c++ 4.8.5
Python 3 interpreter Python 3.6.2
Framework Elements 5.2.2
Version Control System Git 2.7.4
Table 3.1: List of main features in EDEN 2.0. All tools and respective
versions are set for each environment. Source: Romelli et al. [8], ADASS
XXVIII conference proceedings (printing).
the software modules developed can in fact run together, the Consortium
defined a common environment and set of rules. The environment, called
Euclid Development ENviroment (EDEN), is a collection of frameworks and
software packages that enclose all the tools available to the developers. As
a distributed file system, the CernVM File System1
(CVMFS), developed at
CERN as part of its infrastructure, is used. EDEN is locally available to the
developers through LODEEN (LOcal DEvelopment ENviroment), a virtual
machine based on Scientific Linux CentOS 7. The latest stable version, the
2.0, is used in this work. In Table 3.1 the main EDEN 2.0 features are listed.
LODEEN is a local replica of the EDEN environment for Euclid devel-
opers embedded in a virtual machine. It runs on Scientific Linux CentOS 7
operating system with Mate desktop environment. The version 2.0 is used
in this work.
CODEEN It’s a Jenkins-based2
system in charge of performing the contin-
uous integration for all the source code developed inside the Euclid mission.
Jenkins is an open source automation tool the helps to perform build, testing,
delivering and deployment of software [10].
1
https://cernvm.cern.ch/portal/filesystem
2
https://jenkins.io

3.2 Elements
As one of the fundamentals components of EDEN, we can find Elements,
a framework that provides both CMake facilities and Python/C++ utilities.
CMake3
is an open-source tool for building, testing and package software.
Elements is derived from the CERN Gaudi Project4
[9], another open source
project that helps to build frameworks in the domain of event data processing
applications.
Every Elements project must follow a well defined structure in order to be
used inside the mission environment. Every project has to be placed inside a
default folder into the Linux file-system: /home/user/Work/Projects. This
allows to have a shared common location in the whole environment. Conse-
quently the name of each developed project has to be unique. Therefore the
three projects created by the author implement this required structure.
Environment variables Elements needs a few environment variables in
order to build and install the projects. Those come predefined in LODEEN
but they deserve a quick overview. The first variable we will see is the
BINARY TAG that contains the information for the build. It is composed by
four parts separated by a dash:
1. Architecture’s instruction set
2. Name and version number of the distribution
3. Name and version of the compiler
4. Type of build configuration
There are six types of build configurations, the default value is o2g and rep-
resent a standard build. For the specific case of this work, BINARY_TAG=
3
https://cmake.org
4
http://gaudi.web.cern.ch/gaudi

x86_64-co7-gcc48-o2g. The variable $CMAKE PREFIX PATH points to the
newest version of the Elements CMake library. In this case equals to /usr/
share/EuclidEnv/cmake:/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-2.
0/usr The environment variable CMAKE PROJECT PATH contains the location
of the projects. In this case the value is: /home/user/Work/Projects:
/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-2.0/opt/euclid, where the
ﬁrst path indicates the user area for the development and the second the
system-installed projects. At the build moment a project resolves its depen-
dencies through this variable.
Project Structure An Elements project has to be organized in a well de-
ﬁned structure:
ProjectName
CMakeLists.txt
Makefile
ModuleName1
ModuleName2
...
Once built the project by the means of the CMake command make, a
build.${BINARY TAG} folder is generated by the framework. Afterwards an
installation phase is required to execute the program and let other projects
use it as dependency. The structure then becomes:
ProjectName
CMakeLists.txt
Makefile
ModuleName1
ModuleName2
...
build.${BINARY TAG}
InstallArea
Module Structure A Elements module, referred in the project structure
as ModuleName1 or ModuleName2, is a reusable unit of software and can be
made of both C++ and Python code. It is thought as an independent unit

that can be placed in any project. The structure is:
ModuleName
CMakeLists.txt
ModuleName
src
python
script
auxdir
ModuleName
conf
ModuleName
test
Finally, an important part of Elements is the E-Run command, thanks to
which projects can be executed within the framework.
3.3 Image Processing
After the astronomical images are acquired, properly merged and cleaned,
it is possible to extract from them an objects catalog. These catalogs are then
used to perform the analysis. In this context an object is typically a light
source, such as a star, a galaxy or a galaxy cluster. The phase of extraction
can be simplified as a process that involves three main steps:
1. Detection: The detection phase is a process generally referred as seg-
mentation. From an astronomical image, the light sources are identified
and extracted from the background. Segmentation of nontrivial images
is one of the most difficult task in image processing [11] and represent
a critical phase of the scientific processing in order to correctly identify
the sources and obtain accurate results.
2. Deblending: Often light sources are not clearly separated from each
other and in the first place is not possible to distinguish the single

object. Is then necessary to run an additional step of deblending, a
procedure for splitting highly overlapped sources.
3. Photometry: The final step for obtaining a source catalog is the mea-
surement of their luminous flux. This is done by integrating the gray
level value of the pixels labeled as one object. Furthermore, is possible
to apply several masks in order to obtain more accurate measures.
After the execution of these steps, the output generated is a catalog, some
optional check images and an image highlighting the objects detected, as
shown, for instance, in fig. 3.5c.
3.4 Multi-Software Catalog Extractor
The author developed three Elements projects able to perform some basic
image processing. The whole system is thought to represent a prototype
of an Euclid-like project and it was essential to gain confidence with the
environment. Furthermore these projects were the building blocks of the
Euclid-like pipeline, used to test the workflow managers against the chosen
metrics.
The software was designed to follow the Object Oriented Programming
(OOP) paradigm and be easily adaptable to new packages or tools to perform
detection, deblending and photometry. As a whole, the software consists of
four main Elements projects and one supporting with utilities. They are:
• PT MuSoDetection,
• PT MuSoDeblending,
• PT MuSoPhotometry,
• PT MuSoUtils.

Indeed, in order to use another image processing tool, it is sufficient to ex-
tend the Python class Program located in the PT MuSoUtils project, at the
path PT_MuSoCore/python/PT_MuSoCore/Program.py, and implement the
abstract method run with the proper logic. Each software module developed
for the catalog extraction uses SExtractor5
as a image processing tool. This
software, created by Emmanuel Bertin, has the main purpose of extracting
an objects catalog from astronomical images [12] thought the execution of
detection, deblending and photometry. It is part of the set of legacy software
that, in this context, indicates external software that is officially integrated
within EDEN. Since SExtractor isn’t available in Python, the subprocess
module was used in order to create a new process that call the software
through a bash command. The final code is in the SExtractor class.
3.4.1 Developing and Versioning Methods
As version control system Git6
was used, following as much as possible
a clean and easy to understand develop workflow. It was chosen to follow
the Gitflow Workflow, defined for the first time by Vincent Driessen (see fig.
3.1) and later adopted as part of the Atlassian guide7
. This type of workflow
involves a main branch, by convention called master, and includes all the
commits that represent an incremental version of the software. In fact, no
developer can work directly on the master, which can only be merged with
other support branches, in particular a release (for lasts tweaks and version
number change) or a hotfix (used to fix critical issues in the production
software). When changes are merged into the master, by definition that
becomes a new product release [13]. As regards the versioning system, a
semantic versioning was applied to give a standard enumeration for each
project version. A version number is made by three numbers which in turn
represent a major release, a minor release and a patch. Each number has to
5
https://github.com/astromatic/sextractor
6
https://git-scm.com
7
https://it.atlassian.com/git/tutorials/comparing-workflows/
gitflow-workflow

Figure 3.1: A successful git branching model. This workﬂow was followed
during the Elements projects development. It favors a clear development
path that facilitates the creation of a software product, especially within large
teams. However, it remains a good pattern to follow even during autonomous
development. Source: Git Branching - Branching Workﬂows [14].

be incremented according to the following scheme [15]:
• Major version: when you make incompatible API changes.
• Minor version: when you add functionality in a backwards-compatible
manner.
• Patch version: when you make backwards-compatible bug fixes.
As a second branch that is always present in a gitflow repository, it is the
develop. In it are added the changes that are believed to be stable and ready
for a future release. In addition to the release and hotfix branches, a branch
of type feature can be used to develop new features that, once completed and
tested, are added to the develop branch.
3.4.2 Detection
The detection phase occurs by thresholding. Before proceeding, how-
ever, a filtering is necessary to smooth the noise that would otherwise cause
false positives: a Gaussian filter with standard deviation calibrated for the
particular case is used. Moreover, in astronomical images there is often a
non-uniform background that could generate ambiguities when the thresh-
olding is applied, especially in the most crowded areas. For this reason, as
a first step towards segmentation, a background map is generated that es-
timates the light outside the objects to be detected [16]. This map will be
subtracted from the image. As output of the detection we obtain an image
partitioned in N regions, the first one of which, marked by pixels with value 0,
is the background, while the remaining N-1 regions are the extracted objects
and are labeled with values from 1 to N.

Figure 3.2: Multithreshold deblending. This technique aims to deblend over-
lapping sources that have been detected as a single object. Source: Bertin
[16].
3.4.3 Deblending
After detection, the segmented image is subjected to a filtering phase that
aims to identify distinct overlapping objects that the first thresholding step
has recognized erroneously as a single object. During this phase, composite
objects are deblended using a multithreshold hierarchical method. An hint
on how the algorithm works ca be seen in fig. 3.2.
3.4.4 Photometry
The purpose of this last phase is to measure the luminous flux belonging
to each object identified after the deblending phase. For each set of pixels
labeled with the same number, the sum of the pixel values is calculated to
estimate the luminous flux. Some masks can be applied around objects,
called apertures, to obtain different type of measurements (see fig. 3.3).

Figure 3.3: Four different type of apertures available in SExtractor for the
photometric measurement. Source: Holwerda [17].
3.5 Utils Module
The PT MuSoUtils module contains five important packages:
• PT MuSoCore,
• IOUtils,
• SkySim,
• FITSUtils,
• DataModel.
Since each pipeline task can use a different program for performing its pro-
cessing step, a system for passing the parameters independently of the rest
of the code has been created. The configuration parameters can be spec-
ified in a JSON8
file that a parser developed for the occasion can read.
8
https://www.json.org

The language was chosen for its human readable—and writable—structure
and it’s usually very easy to handle inside the code, with a great support
in Python (as it is in many other programming languages). Any program
can have its own JSON file named <program>.json located in the path
PT_MuSoUtils/PT_MuSoCore/auxdir/config. The structure of the file fol-
lows this base scheme:
{
"program ": {
"full_name ": <program_name >,
"command_name ": <cmd_key >
},
" configurations ": {
<configuration_type >: {
<configuration_unit >: {
<parameter_name >: <parameter_value >
}
}
}
The configuration file is a JSON object with 2 names: program and confi-
gurations.
The value of program is a nested JSON object that expects two keys with
string value:
• full name: indicates the name of the program and is useful for logging
purposes.
• command name: indicates the terminal command to call if the pro-
gram is executed through subprocess.
The value of configurations is a nested JSON object that accepts any number
of items representing a configuration type. Each configuration type is a nested
JSON object that accepts any number of configuration units. Each configu-
ration unit in turn accepts any number of key/string items representing an
input parameter for the application.

• configuration type: is a collection of configurations units, and it is
intended to group all the parameters needed for a certain task, e.g. the
detection or deblending.
• configuration unit: is a collections of parameters that can be used
as a modular unit or a preset for some particular instance of the con-
figuration type, e.g. the detection of light sources in a crowded region
of the sky.
A configuration unit can be defined modular because another configuration
units, within the same file, can inherit its parameters without the need to
rewrite them, in an OOP fashion. In order to do so, is sufficient to add
the key inherit that expects a JSON array as value. The items of the ar-
ray are the configuration unit keys from which inherit the parameters. If
the configuration units conf unit 1 in the configuration type conf type 1
wants to inherit the configuration unit conf unit 2 contained in the con-
figuration type conf type 2, then the value of the key inherit has to be
”conf unit 2.conf type 2”. For example:
" configurations ": {
"base ": {
"catalog_ascii ": {
"-CATALOG_TYPE ": "ASCII_HEAD"
}
},
"detection ": {
" background_checkimages ": {
"inherit ": [
"base.catalog_ascii"
],
...
}
}
}
If two parameters have the same key, only the last value encountered by the
parser is preserved. Each type of configuration can be called by a different

project and some of them require to load other external file from its own
auxiliary directory. For this reason, a constant auxdir can be define inside
a configuration type and the parser will replace at run-time any sub-string
matching ”{auxdir}” with the value of the constant.
3.5.1 The Masking Task
Since the deblending task is a refinement phase with respect to the de-
tection, it needs to know the segmented image, but it can not attempt to
deblend an object if it is made up of pixels with all the same value and
isn’t representative of the original source. For this reason, as input of the
deblending, we must put the original image masked with the result of the de-
tection. Pixels that have been labeled as background are brought to a value
of -1000 in order to not interfere with the thresholding, while the remaining
pixels are left unchanged. For this purpose an additional Python module was
developed, called FITSMask.
3.5.2 FITS Images
The Flexible Image Transport System (FITS) format was created for shar-
ing astronomical images for sharing among observatories. The control au-
thority for the format is the International Astronomical Union - Fits Working
Group9
(IAU-FWG). The need for this standard stems from the difficulty of
standardizing the format among all observatories with different characteris-
tics and the impossibility of create adapters among all the different formats.
Consequently it was created a standard for which every observatory is able
to transform data from FITS to its own internal format, and vice versa.
A FITS file is composed of blocks of 2880 bytes, organized in a sequence
of Header and Data Unit or HUD eventually followed by special records. The
9
https://fits.gsfc.nasa.gov/iaufwg

header is consisting of one or more blocks of 2880 bytes and contains stand
alone information or metadata that describes the subordinated unit of data
[18]. To manipulate this format several python modules are available, such
as the astopy.io.fits, used in this work within the FITSUtil module. It was a
necessary step for the Euclid-like pipeline because the mission uses this file
format for storing and distributing the data collected by the space telescope.
3.5.3 SkySim
As final part of the Euclid-like pipeline, SkySim, a basic sky simulator,
was developed, which has the capability of generating synthetic astronomical
images starting from a source catalog. The only purpose of this module
was to briefly validate the pipeline and check if the catalog extracted was
consistent with the one put as input. First, an image with no sources was
generated (fig. 3.4), followed by an image with one source (fig. 3.5), with two
identical sources (fig. 3.6), with two sources with different magnitude (fig.
3.7), with two overlapping sources (fig. 3.8) and finally with two overlapping
sources with different magnitude (fig. 3.9). Each of this tests gave a positive
result and the pipeline performed as expected, extracting the right number
of sources with the correct photometric estimation.
Real images are typically degraded by noise. SkySim has the ability to
add additive noise with Gaussian distribution to an image according to a
(a) Original (b) Deblended (c) Photometry
Figure 3.4: No sources

Figure 3.5: One source
Figure 3.6: Two identical sources
Figure 3.7: Two sources with diﬀerent magnitude
Figure 3.8: Two identical overlapping sources

Figure 3.9: Two overlapping sources with different magnitude
predetermined Signal-to-Noise Ratio (SNR) value in dB. A normal Gaussian
noise is defined by its mean µN and its standard deviation σN . Given the
desired SNR to obtain in the synthetic image I, it’s possible to calculate the
σN of the Gaussian noise to add:
σN =
var(I)
10SNR/10
By adding the pixel-by-pixel the values of the matrices that represent the
noise and the original image, SkySim outputs an image with the chosen
SNR.
3.6 Pipeline Project
The pipeline project is the part of the developed software that aims to
define and build the Euclid-like pipeline. Two files are required in order
to define a pipeline: the Package Definition and the Pipeline Script. In
the package definition file, as required by the Euclid Pipeline Runner, four
Executable Python objects were implemented, one for each task. In listing
3.1 is represented the first task of the pipeline and performs the detection
phase. Its input and outputs are specified as paths relative to workdir. The
Executable of masking (listing 3.2), deblending (listing 3.3) and photometry
(listing 3.4) have been defined in a completely similar manner.

1 p t s e x t r a c t o r d e t e c t i o n = Executable (
2 command=’ ’ . j o i n ( [
3 ’E−Run PT MuSoDetection 0.2 ’ ,
4 ’ SourceDetectorPipeline ’ ,
5 ’ sextractor ’ ,
6 ’−−preset p i p e l i n e ’
7 ] ) ,
8 inputs =[Input ( ’ f i t s i m a g e ’ ) ] ,
9 outputs =[Output ( ’ segmentation map ’ ) ,
10 Output ( ’ catalog ’ , mime type=’ txt ’ )
11 ]
12 )
Listing 3.1: Definition of the detection task
1 mask image = Executable (
3 ’E−Run PT MuSoUtils 0.2 ’ ,
4 ’FITSMask ’ ,
5 ’−−mask value −1000 ’
6 ] ) ,
7 inputs =[Input ( ’ f i t s i m a g e ’ ) ,
8 Input ( ’mask ’ ) ] ,
9 outputs =[Output ( ’ masked ’ ) ]
10 )
Listing 3.2: Definition of the masking task
1 pt sextractor deblending = Executable (
3 ’E−Run PT MuSoDeblending 0.2 ’ ,
4 ’ SourceDeblenderPipeline ’ ,
7 ] ) ,
8 inputs =[Input ( ’ f i t s i m a g e ’ ) ] ,
9 outputs =[Output ( ’ segmentation map ’ ) ,
10 Output ( ’ catalog ’ , mime type=’ txt ’ ) ]
11 )
Listing 3.3: Definition of the deblending task

1 pt sextractor photometry = Executable (
3 ’E−Run PT MuSoPhotometry 0.2 ’ ,
4 ’ SourcePhotometerPipeline ’ ,
6 ’−−log−l e v e l debug ’ ,
8 ] ) ,
9 inputs =[Input ( ’ f i t s i m a g e ’ ) ,
10 Input ( ’ f i t s i m a g e 2 ’ ) ,
11 Input ( ’ assoc catalog ’ ) ] ,
12 outputs =[Output ( ’ apertures map ’ ) ,
13 Output ( ’ catalog ’ , mime type=’ txt ’ ) ]
14 )
Listing 3.4: Definition of the photometry task
In addition the pipeline script file was created, where the source extractor
method, decorated with the decorator @pipeline, provided by the euclidean
framework (see listing 3.5). The logic inside specifies how to build the pipeline
that will be executed by Euclid Pipeline Runner.
1 @pipeline ( outputs=( ’ segmentation map ’ ,
2 ’ apertures map ’ ,
3 ’ catalog ’ ) )
4 def s o u r c e s e x t r a c t o r ( image ) :
5 seg map , = p t s e x t r a c t o r d e t e c t i o n ( f i t s i m a g e=image )
6 masked = mask image ( f i t s i m a g e=image , mask=seg map )
7 seg map deb , deb catalog =
8 pt sextractor deblending ( f i t s i m a g e=masked )
9 apertures , catalog =
10 pt sextractor photometry ( f i t s i m a g e=masked ,
11 f i t s i m a g e 2=image ,
12 assoc catalog=deb catalog )
13 return seg map deb , apertures , catalog
Listing 3.5: Definition of the Euclid-like pipeline that will be executed by
the EPR.

Chapter 4
External Workflow Managers
In the thesis work it is proposed to compare the workflow managers. As
starting point, it was decided to use the Euclid-like pipeline already developed
and to create two more pipelines implemented by means of two external
tools: Spotify’s Luigi and Airbnb’s Airflow. They have been chosen because
they are written in Python, thus EDEN compliant, are open-source and very
popular within the data flow topic.
4.1 Luigi
Luigi1
is a python package developed by the Spotify team and released
in 2012 under the Apache License 2.02
. This tool helps to build pipeline of
batch jobs [19]. Its features include: workflow management, task scheduling
and dependency resolution. One of the Luigi’s strengths is to manage failures
in a smart way, providing a build-in system for task status checking. In fact,
if a task fails and has to be rescheduled and rerun, Luigi goes across the
1
https://github.com/spotify/luigi
2
The Apache License 2.0 is the second major release of the permissive free software
license written by the Apache Software Foundation. See https://www.apache.org/
licenses/LICENSE-2.0.
36

CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 37
dependency graph backwards until it encounters a successfully completed
task. It then reschedules only the tasks downstream of the graph from that
point, thus not scraping the work done without failures. This can save a
lot of time and computing resources if the failures are not infrequent or if a
pipeline shares some tasks already executed by another pipeline.
4.1.1 Pipeline Definition
A Luigi’s pipeline is made of one or more tasks. For each task we can
define an input, an output and the business logic to execute. The input
is the set of parameters passed and the dependency list. The output is
generally a file, called target, that will be written to the local file-system or
to a distributed one, such as Hadoop Distributed File System (HDFS).
Target The target is the product the task has to yield and it’s defined
through a file-system path. In Luigi we can find the class Target that rep-
resent this concept. As default configuration, the existence of the file is the
proof that the task did execute successfully and its status is completed or
not. It’s possible however to override the default logic for deciding whenever
the task has to be considered done.
Task A task consists of four fundamental parts:
• Parameters: are the input arguments for the Task class and, with
the class name, uniquely identify the task. They are defined inside the
class using a Parameter object or a subclass of it.
• Dependencies: are defined through a task collection and set which
other tasks have to be executed successfully before the current one can
start. Such collection is the object to return in the overridden method
requires.

• Business logic: is defined within the overridden method run. This
part of code is in charge to produce and store the output of the task.
• Outputs: are defined through a collection of Target objects. Each
target must points to the exact location of the file created by the busi-
ness logic.
Tasks Execution All the tasks defined in a Luigi pipeline are executed
in the same process making the debugging process clear, but also setting a
limit for the number of task a pipeline can be made of. Generally speaking,
however, it doesn’t represent a real problem until thousands of tasks are
executed in the same pipeline [19]. The execution of a task follows this
steps:
1. Check if the predicate that defines the completed status is satisfied. If
it is, then check the next task in the graph. If the current task is the
last one, the pipeline is completed.
2. Resolve all the dependencies. If one task from the dependencies is not
completed, then execute that task.
3. Execute the run method.
In order to start the entire pipeline, is sufficient to call the last task defined
in it and, thanks to the recursive algorithm, all tasks will be executed. Luigi
does not come with an embedded triggering system, but it can be easily
implemented.
4.1.2 Integration with Elements
It’s not obvious that two frameworks can operate together and be easily
integrated. In this section we will see how the author proposed to accomplish

the task of developing a Luigi pipeline that executes the Elements projects
previously written and described in Chapter 3.
Task Implementation In the case of this work, the logic to execute is
essentially a call of the Elements project through the bash command E-Run.
The command is preset in EDEN and is associated with the execution of
the script that starts the Elements execution. The ExternalProgramTask,
subclass of Task, was then used. This class is part of the Luigi’s contrib
module and it manages the logic of the run method and exposes another
method, program args, whose return output is the list of strings that will be
the argument for the subprocess.Popen class. It was chosen to implement
the pipeline following the behavior of EPR, so to get a similar to use system.
As input, a path of the working directory and the data model file are required.
Optionally it can be specified an id as parameter that allows to differentiate
otherwise identical tasks (mainly for test and debug purposes). Each task has
its own target file defined inside. The path must be relative to the working
directory. The return value of the program args method is simply the string
seen in listings 3.1 - 3.4 put as command argument.
4.2 Airflow
Airflow3
is a python package developed by Airbnb and released in 2015
under the Apache License 2.0. In March 2016 the project joined the Apache
Software Foundation’s incubation program4
and since then is growing very
quickly. The Incubator project is the path to become part of the Apache Soft-
ware Foundation (ASF) for projects that want to contribute to the Apache
foundation. All the code that will become part of the ASF must first pass
through an incubation period within this project [20]. Airflow is a tool for de-
scribing, executing, and monitoring workflows. One of the airflow’s strengths
3
https://airflow.apache.org
4
https://incubator.apache.org.

is the simplicity with which we can define the pipeline, though it doesn’t offer
a native way for specifying how the intermediate results are managed. More-
over it presents a great interactive graphic user interface that makes easy to
monitor the execution progress and state.
4.2.1 DAG Definition
In Airflow every pipeline if defined as a Directed Acyclic Graph (DAG).
As a matter of fact, the tool offers a Python class DAG that contains all
the information needed for the tasks execution, such as dag id, dependency
graph, start time, scheduling period, number of retry allowed and many
other options. A task is a node of the dependency graph and is coded as an
object of a class BaseOperator. That class is abstract and is designed to be
inherited to define one of the three main operator types:
• Action operator, that perform an action or trigger one
• Transfer operator, in charge of moving data from one system to an-
other
• Sensor operator, that runs until a certain criterion is satisfied, like the
existence of a file or a time in the day is reached.
4.2.2 Pipelining Elements Projects
As done with Luigi and described in Section 4.1, the author developed an
Airflow pipeline inside the Elements framework for executing the projects de-
scribed in Chapter 3. First a DAG object has to be initialized with few manda-
tory arguments as input: id, default arguments and schedule interval.
Listing 4.1 shows how the dag can be created.
1 d e f a u l t a r g s = {
2 ’ owner ’ : ’ a i r f l o w ’ ,

3 ’ s t a r t d a t e ’ : datetime . utcnow () ,
4 ’ email ’ : [ ’ airflow@example . com ’ ] ,
5 ’ r e t r i e s ’ : 1 ,
6 ’ retry delay ’ : timedelta ( minutes=1)
7 }
8
9 dag = DAG( dag id=’ c a t a l o g e x t r a c t o r ’ , d e f a u l t a r g s=default args ,
10 s c h e d u l e i n t e r v a l=’ @once ’ )
Listing 4.1: Definition of the main DAG with Airflow for the catalog extractor
pipeline.
To define a task performed by a bash command, there is the BashOperator
object, which extends the BaseOperator. The string representing the bash
command to be executed can be put as input of the operator. An example
is shown in listing 4.2.
1 cmd = ’ ’ . j o i n ( [ ’E−Run ’ ,
2 ’ PT MuSoDetection ’ , ’ 0.2 ’ ,
3 ’ SourceDetectorPipeline ’ ,
5 ’−−preset ’ , ’ p i p e l i n e ’ ,
6 ’−−workdir ’ , work dir ,
7 ’−−l o g d i r ’ , logdir ,
8 ’−−f i t s i m a g e ’ , i n p u t f i l e ,
9 ’−−segmentation map ’ , d e t e c t f i t s ,
10 ’−−catalog ’ , detect cat
11 ] )
12
13 detection task = BashOperator (
14 task id=’ detection ’ ,
15 bash command=cmd,
16 dag=dag )
Listing 4.2: Definition of the detection task with Airflow.
After defining the tasks, there are several ways to define the dependency
graph. In this case it was chosen to use the shift operator which is overloaded
in BaseOperator to define the concatenation.

Chapter 5
Comparison
The purpose of this work is to make a comparison between different work-
flow managers from different contexts and verify, through a test on the case
study of Euclid, if such external tools can be adopted as an integral part of
the system. Four Elements projects have been developed that act as building
blocks for the pipelines and the EPR, Luigi and Airflow workflow managers
have been studied. In this chapter we will propose ten metrics to evaluate
whether Luigi or Airflow could be used within the mission. Subsequently, the
executions necessary for the comparison will be performed and the results
will be analyzed.
5.1 Metrics
At this point of the work everything is set to execute the pipelines and
gather the data needed for the comparison. As comparison metrics we pro-
posed:
• Execution time: is an indicator of the overhead of the tool and shows
how the scheduler handles the executions.
42

CHAPTER 5. COMPARISON 43
• RAM memory usage: will give a measure of the memory usage
needed by the tool. It’s important because the resources are limited.
• CPU usage: will give a measure of the CPU usage needed by the tool.
It’s important because the resources are limited.
• Error handling: will show how the tool can handle failure inside the
system. It’s critical for automatic recovery and minimization of human
intervention.
• Usability and configuration complexity: will show how complex
the tool is to use and configure.
• Distributed computing management: will show the capability of
the tool to execute tasks in a distributed environment.
• Workflow visualization: is important for monitoring the execution
progress and spot any problem inside the system.
• Integration in a framework: will show if the tool is easy to use
inside an existing framework, critical for the current case study.
• Triggering system: will show if the tool is capable of automating the
execution triggering.
• Logging quality: will show if the log produced are in fact useful for
the developers to debug the system in case of failure.
At the end of the data collection we’ll draw the conclusions comparing the
two types of workflow manager. The machine used for the test has an Intel(R)
Core(TM) i7-4770 CPU @ 3.40 GHz, with 4 cores dedicated to the virtual
machine and a total dedicated memory RAM of 10.46 GB.

5.2 Execution Time
In order to profile the three workflow managers with respect of time, it
was sufficient to use the Linux built-in command time and some generated
logs. Each tool was setup to run with no delay and minimum waiting time
and was tested with three different images put in input to the pipeline: a
small one 252x252 pixels, a large one 5000x5000 and another small image
256x256 generated by SkySim, referred as simulated. All the results are
averaged over 10 runs. The time command gives as result three numbers:
real time, user time and sys time. The meanings are [21]:
• real: or wall clock time, indicates the total time elapsed from start to
finish.
• user: CPU time spent within the process outside the Linux kernel, or
in user mode. In user mode, the process can’t directly access hardware
or reference memory. Code running in this mode can perform lower
level accesses only through system APIs.
• sys: CPU time spent within the process inside the Linux kernel, or
in kernel mode. In kernel mode, the process can execute any CPU
instruction and access any memory address, without any restriction.
This mode is generally reserved for low-level functions of the operating
system and must therefore consist of trusted code. There are some
privileged instructions that can only be executed in kernel mode. These
are interrupt instructions, input output management and so on. If this
type of instructions are executed in user mode a trap will be generated.
From tables 5.1, 5.2 and 5.3 we can notice three main differences. First of
all, EPR will execute all three pipeline instances in roughly the same time,
due to internal scheduling settings that are not meant to be changed by the
user. Secondly, although ERP and Airflow perform about the same with
the large image as an input, in the case of the two smaller images, Airflow

EPR Luigi Airflow
real 40.419 16.665 38.325
user 21.258 14.570 21.194
sys 8.346 0.922 1.688
Table 5.1: Time needed for executing the scientific pipeline on the large
image. Values are in seconds, divided by real, user and sys time.
EPR Luigi Airflow
real 40.413 2.652 30.171
user 21.481 1.607 8.701
sys 8.649 0.298 1.022
Table 5.2: Time needed for executing the scientific pipeline on the small
EPR Luigi Airflow
real 40.409 2.634 30.008
user 21.871 1.781 8.526
sys 8.816 0.347 1.181
Table 5.3: Time needed for executing the scientific pipeline on the simulated
can complete the pipeline faster, reducing the real time accordingly with the
lowering of user and sys times. This means that Airflow keeps the overhead
steady in each execution. EPR, in the other hand, needs always the same
real, user and sys times in order to complete all task of the pipeline. This
lead to conclude that EPR has time slots in which it performs the tasks and
this behavior can not be modified. It must be noted that the sys time is
consistently much higher than the other instruments. Finally, Luigi confirms
itself as the most lightweight workflow manager among the three from a time
perspective. Its scheduler executes each tasks right away as soon as enough
system resources are available.

5.3 Memory Usage
As memory profiling tool it was chosen the Python memory profiler1
, a
module for recording the memory consumption of a process and its children
based on psutil2
. It was used in the time-based configuration, where the
memory usage is plotted as function of execution time. Indeed it’s possible
to directly plot the data after recording them by means of the same module.
In order to start the monitoring, the command mprof run <script> can be
used. After the script execution is done, the command mprof plot, shows
the plot of the last run recorded. Because all pipelines are built with Elements
projects as tasks, all of them use at least one subprocess in order to run.
For this reason it was set the flags --include-children (or -C for short)
first and then --multiprocess (or -M for short) to tell memory profiler to
consider all the children created by the main process. The include children
flag adds the memory used by the main process and all its children, obtaining
a single comprehensive value each instant. The multiprocess flag considers all
children independently, keeping the memory usage data separated for each
of them. In this section MB will be used as synonym of MiB, equivalent to
220
bits, and GB as synonym of GiB, or 230
bits.
This method was used to profile all three workflow managers, although it
was not possible to obtain meaningful results from the EPR execution, due to
its implementation. The workaround will be described later in this chapter.
After profiling the three workflow managers, each one executing the scientific
pipeline with the three images as inputs, the plots in figs. 5.1 - 5.9 were
obtained. Figures 5.4 - 5.9 show, as expected, an increase in memory usage
due to image processing in correspondence with the four tasks execution.
In the other hand, in figure 5.1 - 5.3, plot of EPR RAM usage, seems that
no execution is detected. Referring to fig. 5.1b compared to figs. 5.4b
and 5.7b it’s evident how the peak memory usage is not compatible with the
amount required by the second task execution (about 350 MB). Furthermore,
1
https://pypi.org/project/memory-profiler
2
https://pypi.org/project/psutil

the three executions, that have to process images with significant difference
in size, seems to require the same amount of memory with an average of
roughly 45 MB. Analyzing the data gathered per process, we can see that
every execution presents a main process of 38 MB and a main child process
of 13 MB. One possible explanation for this behaviour could be the resources
limitation by the EPR, but this limit was set to 1 GB, significantly below
the values obtained and the amount needed by the task. Consequently, it
was conduct a deeper study of the Python source code related to scheduling
and pipeline execution in the euclidean software. It was found out that the
tool is in part built by means of the package Twisted3
and its reactor
component, to which the actual task execution is delegated. Reactor works
in background in a separated thread and communicates with the main thread
though a callback system. This behavior is not detected by the profiler, which
records only the memory used by EPR and Twisted, explaining why all plots
shared a common trend. This finding has led to try other profilers but none
seemed to work properly in this situation. It was then decided to develop a
custom tool for memory profiling, using the features of top bash command.
It has the ability to report the memory and CPU usages, which turned out
to be very useful also in the CPU load analysis.
3
https://twistedmatrix.com

(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant
of time in which the maximum peak occurred. The memory usage is similar
throughout the duration of proﬁling.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Also in this case, allocations of memory compatible with the image
processing under examination are not noticed.
Figure 5.1: EPR RAM memory usage versus time (large image).

throughout the duration of proﬁling, as it happens in ﬁg. 5.1.
Figure 5.2: EPR RAM memory usage versus time (small image).

throughout the duration of proﬁling, as it happens in ﬁg. 5.1
Figure 5.3: EPR RAM memory usage versus time (simulated image).

during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
are colored. Four main peaks are distinctly visible, they are associate with the
execution of the four tasks in the pipeline. Three of them are SExtractor processes
(in green), while the highest peak in red represent the memory usage of the Python
masking project.
Figure 5.4: Luigi RAM memory usage versus time (large image).

are colored.
Figure 5.5: Luigi RAM memory usage versus time (small image).

are colored.
Figure 5.6: Luigi RAM memory usage versus time (simulated image).

are colored. Airﬂow needs seven processes in order to execute the pipeline.
Figure 5.7: Airﬂow RAM memory usage versus time (large image).

Figure 5.8: Airﬂow RAM memory usage versus time (small image).

Figure 5.9: Airﬂow RAM memory usage versus time (simulated image).

5.3.1 Top Based Profiling Tool
Euclid Pipeline Runner has a structure that doesn’t allow to profile the ex-
ecution directly. In fact it run asynchronously the tasks though a Twisted job
submission. The top command allows to obtain a resources usage overview
on the whole system in a particular time instant ass overall statistics and di-
vided by process. The visualization is interactive and it’s possible to display
processes sorted by memory or CPU usage. Top allows also to specify the
time interval between samples, in this case 0.1 s. The flag -c tells to show
the command line associated with the process instead of just the program’s
name. This was necessary in order to include the right processes in the pro-
filing. In fact, top shows a system snapshot and it’s crucial to manually
filter all and only the wanted tasks. Finally it was set the -b flag to run the
command in batch mode, to be used when it’s required to redirect the output
into a file. Each sample was indeed written to a file, which was subsequently
pre-processed until obtain a Python array ready for the actual analysis. At
the end, the bash command was top -d 0.1 -c -b > out.top. The pre-
processing and processing phases were written in a Python script. The steps
used for the pre-processing were:
1. Load the content of the top file as string;
2. Extraction of lines containing the keywords that identify the processes
to profile;
3. Additional processes filtering to remove unwanted items;
4. Replacement of multiple blank lines with a single one;
5. Text strip to remove void parts at the beginning and end of the file
content;
6. Columns extraction, keeping only the memory and CPU load values;
7. Split string with blank line as separator, obtaining a sample array,

0 10 20 30 40
time (in seconds)
100
200
300
400
memoryused(inMiB)
Figure 5.10: EPR memory usage versus time obtained with the custom pro-
ﬁler (large image). Unlike what we saw in ﬁg. 5.1a, in this case the four
peaks corresponding to the tasks execution are clearly visible.
where a sample contains the values of one or more simultaneous pro-
cesses;
8. Split sample lines with space as separator, obtaining a values array for
each process;
9. For each sample, sum corresponding values of all relative processes,
obtaining one value per parameter per sample.

0 5 10 15 20 25 30 35 40
time (in seconds)
20
40
60
80
100
memoryused(inMiB)
ﬁler (small image). Unlike what shown in ﬁg. 5.2a, in this case the four

0 5 10 15 20 25 30 35 40
time (in seconds)
20
40
60
80
100
memoryused(inMiB)
ﬁler (simulated image). Unlike what shown in ﬁg. 5.3a, in this case the four

The profiling has been repeated also for Luigi and Airflow in order to
have a check on the reliability of the method. The results obtained with
memory profiler and this tool was compared, verifying the consistency of
the two. The plots obtained by means of the custom profiler are shown in
Figs. 5.13 and 5.14. The results proved to be consistent with those previously
obtained.

0 2 4 6 8 10 12 14
time (in seconds)
0
50
100
150
200
250
300
350
memoryused(inMiB)
(a) large image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
10
20
30
40
50
60
70
memoryused(inMiB)
(b) small image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
10
20
30
40
50
60
70
memoryused(inMiB)
(c) simulated image
Figure 5.13: Luigi RAM memory usage with the custom proﬁler. The results
are consistent with what shown in ﬁgs. 5.4 - 5.6.

0 5 10 15 20 25 30
time (in seconds)
0
100
200
300
400
500
memoryused(inMiB)
(a) large image
0 5 10 15 20 25
time (in seconds)
0
25
50
75
100
125
150
175
200
memoryused(inMiB)
(b) small image
0 5 10 15 20 25
time (in seconds)
0
25
50
75
100
125
150
175
200
memoryused(inMiB)
(c) simulated image
Figure 5.14: Airflow RAM memory usage with the custom profiler. The
results are consistent with what shown in figs. 5.7 - 5.9.

5.4 CPU Usage
For CPU profiling, it was used the approach developed during the memory
analysis. As already mentioned, the use of top was a big help also to gather
data about CPU usage. Therefore a Python array of CPU usage was obtained
in the same manner described in subsection 5.3.1. Within the developed
pipeline, each task has to be run sequentially. The percentage of CPU load
is referred to the 4 cores dedicated to the virtual machine and, because that,
we expect to see an upper bound in the load set to 25%. In the other hand,
EPR runs on multiple processes and there is some chances to see loads over
25%. As expected, the plots in fig. 5.15 show that the CPU usage of ERP
can be as high as 43%, meanwhile in figs 5.16 and 5.17 the task execution
doesn’t exceed one core full usage, except for some isolated peaks. Of course,
both Luigi and Airflow can work in parallel inside a multi-core machine if
pipeline and tasks allow it. The interesting fact to note is how the resources
are used during the idle time, with a noticeable difference between Airflow
and ERP: the first one uses negligible to none CPU resources when no task is
executed, the other seems to require a constant CPU usage. The cause of this
behavior is the cyclic polling needed in order to check the execution status
of the tasks by EPR. Luigi, as shown in fig. 5.16, uses as much resources as
possible with minimum idle time. This characteristic makes it the quickest
workflow manager among the three, but it’s an aspect to consider carefully
when designing and launching the pipelines in order to not saturate the
resource available on the system. In all cases we can notice a peak in the
CPU usage due to the initialization of scheduler and web server.

0 10 20 30 40
time (in seconds)
0
10
20
30
40
%CPUusage
(a) large image
0 5 10 15 20 25 30 35 40
time (in seconds)
0
10
20
30
40
%CPUusage
(b) small image
0 5 10 15 20 25 30 35 40
time (in seconds)
0
5
10
15
20
25
30
%CPUusage
(c) simulated image
Figure 5.15: EPR CPU usage. This workﬂow manager constantly occupies
a certain amount of CPU to perform periodic polling to the Reactor, the
Twisted component to which the tasks execution is delegated.

0 2 4 6 8 10 12 14
time (in seconds)
0
5
10
15
20
25
%CPUusage
(a) large image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
0
5
10
15
20
25
%CPUusage
(b) small image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
0
5
10
15
20
25
%CPUusage
(c) simulated image
Figure 5.16: Luigi CPU usage. Luigi exploits all the resources available to
execute the tasks and has no waiting time.

0 5 10 15 20 25 30
time (in seconds)
0
5
10
15
20
25
30
%CPUusage
(a) large image
0 5 10 15 20 25
time (in seconds)
0
5
10
15
20
25
%CPUusage
(b) small image
0 5 10 15 20 25
time (in seconds)
0
5
10
15
20
25
%CPUusage
(c) simulated image
Figure 5.17: Airﬂow CPU usage. Airﬂow exploits all the resource available
to execute the tasks and presents some waiting time.

5.5 Error Handling
The error handling is how the workflow manager behaves in case of failure
during the pipeline execution. An exception can raise in any point of the
pipeline execution and it’s the workflow manager duty handle the situation
and prevent or minimize repercussions on the system. Typically, two kind of
actions can be taken:
1. Abort and cancel the job
2. Abort and reschedule the job
Within the scientific context we are considering, it’s rare the case where a
failed pipeline is no longer needed and thus cancelled. In fact the scientific
value of the output is preserved even if the result is available some time
later, not having any deadline or real-time purpose. Then, ideally, we want
every triggered pipeline to be completed successfully. For these reasons,
we’ll analyze the workflow managers behavior in the case of failure in the
specific instance where it’s desirable that the pipeline is rescheduled until its
completion.
In order to simulate a failure, we introduced and error in the second
task of the Euclid-like pipeline developed previously, making it impossible to
complete successfully. Then, the workflow manager behavior in front of this
situation was observed, noticing, as expected, a stop in the pipeline execution.
EPR doesn’t implement any recovery strategy in case of pipeline failure. This
feature is foreseen in future versions of EDEN; for the moment the continuous
deployment tool guarantees a stable version for the pipeline is available. In
the scientific context is often necessary to validate the output by an expert an
thus requiring some sort of human intervention in any case. However an error
management component brings an important automation that reduces the
time needed to reset the environment and restart the pipeline. EPR notifies
the user highlighting in red the task in the dataflow graph where the error has

Figure 5.18: EPR error notification. The node in red highlight the task
where the failure occurred. At the same time the traceback of the execution
is shown.
occurred and gives a log message with the traceback of the execution (fig.
5.18). From this state it is not possible to recover the pipeline execution.
The user then has to fix the issue and re-run the entire pipeline, performing
all tasks again, including those already completed successfully.
Luigi handles a task crash in a smarter way, keeping the work done suc-
cessfully before the failure. As we can see in listing 5.1 Luigi notifies the user
through the terminal and at the same time the Graphic User Interface (GUI)
(figs. 5.19 and 5.20).
1 ===== Luigi Execution Summary =====
2
3 Scheduled 4 tasks of which :
4 ∗ 1 ran s u c c e s s f u l l y :
5 − 1 DetectionTask(<params>)
6 ∗ 1 f a i l e d :
7 − 1 MaskingTask(<params>)
8 ∗ 2 were l e f t pending , among these :
9 ∗ 2 had f a i l e d dependencies :

Figure 5.19: Luigi error notiﬁcation on task list page. The task where the
error occurred is highlighted and a warning on the downstream task is shown.

Figure 5.20: Luigi error notification on dependencies graph page. The task
where the error occurred is highlighted (red dot).
10 − 1 DeblendingTask(<params>)
11 − 1 PhotometryTask(<params>)
12
13 This progress looks : ( because there were f a i l e d tasks
14
Listing 5.1: Luigi terminal output in case of task failure. In this case a
scenario where the second task fails is simulated.
In order to recover from this state, is sufficient to fix the issue and re-run
the pipeline without any other intervention. Luigi automatically checks each
task, from the last to the first, and if the required target exists for that
specific task, recover the pipeline execution from that point. This simple yet
very efficient way to handle errors makes Luigi an interesting choice if failure
rate is high and it’s very handy in the development phase. During the test,
we simply fixed the error in the second task and triggered again the pipeline.
The execution was a success and the final output was generated as expected.

Figure 5.21: Luigi recovered execution on task list page. All tasks are green
which indicates the successful execution.
For the outputs generated after the recover, see listing 5.2, ﬁgs. 5.21 and
5.22.
2
4 ∗ 1 complete ones were encountered :
5 − 1 DetectionTask(<params>)
6 ∗ 3 ran s u c c e s s f u l l y :
7 − 1 DeblendingTask(<params>)
8 − 1 MaskingTask(<params>)
10
11 This progress looks : ) because there were no
12 f a i l e d tasks or missing dependencies
13

Figure 5.22: Luigi recovered execution on dependency graph page. All tasks
are green which indicates the successful execution.
Listing 5.2: Luigi terminal output after pipeline recovery.
Another interesting feature that comes with this type of error handling, is
that if a pipeline is incorrectly triggered after a successful execution, Luigi
won’t run any task and instead notiﬁes the user that the wanted output is
already available (see listing 5.3).
2
4 ∗ 1 complete ones were encountered :
6
7 Did not run any tasks
8 This progress looks : ) because there were no
9 f a i l e d tasks or missing dependencies
10

Figure 5.23: Airflow error notification. Each column represents a pipeline
run and each block is a task. The failed task is the red square, while the red
round represents the pipeline that has not been successfully completed.
Listing 5.3: Luigi terminal output in case a completed pipeline is executed
again.
Airflow takes on an intermediate behavior, requiring some manual inter-
vention to specify from where recover the pipeline. All the notifications are
through the web server, and we can see one of the several screen showing the
pipeline execution status in fig. 5.23.
5.6 Usability
Usability take in consideration: installation process and configuration,
pipeline definition and execution. The setup taken into account is: web server
and scheduler running in background with the help of a systemd service and
manual pipeline startup.

5.6.1 EPR Installation and Configuration
The Euclid Pipeline Runner need these steps in order to be installed and
configured:
1. Deploy to CVMFS or clone the libraries from the repository.
2. Define the configuration file for the systemd service into /etc/systemd/
system/euclid-ial-wfm.service/local.conf.
3. Define the configuration file for the Interface Abstraction Layer4
(IAL)
into /etc/euclid-ial/euclid_prs_app.cfg.
4. Define server configuration file into /etc/euclid-ial/euclid_prs.
cfg.
5. Load into the environment the two file /etc/profile.d/euclid.sh
and /etc/euclid-ial/euclid_prs.cfg.
6. Load into the environment the variable EUCLID PRS CFG=/etc/euclid-ial/
euclid_prs_app.cfg.
7. Start the service daemond through the bash commands systemctl
enable euclid-ial-wfm and systemctl start euclid-ial-wfm.
As pipeline definition, the EPR needs the tasks to be defined with a bash
command that in turn calls an Euclid project module. We have to create
two files to build the pipeline:
• Package definition: where the task are made wrapping the bash
commands in an Executor object, defining also input, output and the
max resources allowed to be allocated for the specific task.
4
An abstraction layer that allows the execution of for data processing software in each
SDC independently of the IT infrastructure

• Pipeline script: where the pipeline and its dependencies graph are
constructed combining the executor objects outlined into the package
definition.
In this manner every task is standalone and reusable as it is in any pipeline.
Therefore the EPR offer a good modular structure for tasks and pipelines
and a built-in capability for input/output definition and parameters passing.
5.6.2 Luigi Installation and Configuration
Luigi needs only three steps to be ready for use:
1. Install the Python package: pip install luigi.
2. Add luigid.service to /usr/lib/systemd/system defining a stan-
dard systemd service configuration.
3. Start the webserver service daemon through the bash commands systemctl
enable luigid and systemctl start luigid.
Luigi doesn’t distinguish task definition and pipeline construction: the de-
pendencies are specified directly inside each task, making them tight to a
particular data flow. It’s however possible to write the task’s login withing
an external function and reuse it in a modular way. Since the dependen-
cies are explicit inside a task, Luigi offers a built-in system to define inputs
and outputs and parameters passing throughout the pipeline. It doesn’t
guarantee a clean code because every parameter has to be passed as input
argument in each task downstream the one that needs it. However, there are
implementations that compensate for this lack, e.g. SciLuigi5
.
5
https://github.com/pharmbio/sciluigi

Workflow management solutions: the ESA Euclid case study

Workflow management solutions: the ESA Euclid case study

Recommended

Recommended

More Related Content

Similar to Workflow management solutions: the ESA Euclid case study

Similar to Workflow management solutions: the ESA Euclid case study (20)

Recently uploaded

Recently uploaded (20)

Workflow management solutions: the ESA Euclid case study