SlideShare a Scribd company logo
University of Trieste
DEPARTMENT OF ENGINEERING AND ARCHITECTURE
Master degree in Computer and Electronic Engineering
Workflow management solutions:
the ESA Euclid case study
Candidate
Marco POTOK
Matricola IN2000004
Thesis advisor
Prof. Francesco FABRIS
Thesis co-advisor
Dott. Erik ROMELLI
Academic Year 2017-2018
Contents
Introduction iii
1 Workflow Management 1
1.1 The Big Data Era . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Arise of e-Science . . . . . . . . . . . . . . . . . . 3
1.2 Workflow Manager . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Towards Scientific Workflow Managers . . . . . . . . . . . . . 5
2 Euclid Mission 8
2.1 Mission objective . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Spacecraft and Instrumentation . . . . . . . . . . . . . 10
2.2 Ground Segment Organization . . . . . . . . . . . . . . . . . . 13
2.3 SDC-IT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Software Development 17
3.1 Development Environment . . . . . . . . . . . . . . . . . . . . 17
3.2 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Multi-Software Catalog Extractor . . . . . . . . . . . . . . . . 22
3.4.1 Developing and Versioning Methods . . . . . . . . . . . 23
3.4.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.3 Deblending . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.4 Photometry . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Utils Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5.1 The Masking Task . . . . . . . . . . . . . . . . . . . . 30
3.5.2 FITS Images . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.3 SkySim . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Pipeline Project . . . . . . . . . . . . . . . . . . . . . . . . . . 33
i
CONTENTS ii
4 External Workflow Managers 36
4.1 Luigi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.1 Pipeline Definition . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Integration with Elements . . . . . . . . . . . . . . . . 38
4.2 Airflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 DAG Definition . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 Pipelining Elements Projects . . . . . . . . . . . . . . . 40
5 Comparison 42
5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Top Based Profiling Tool . . . . . . . . . . . . . . . . . 57
5.4 CPU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.6 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.1 EPR Installation and Configuration . . . . . . . . . . . 75
5.6.2 Luigi Installation and Configuration . . . . . . . . . . . 76
5.6.3 Airflow Installation and Configuration . . . . . . . . . 77
5.7 Distributed Computing . . . . . . . . . . . . . . . . . . . . . . 78
5.8 Workflow Visualization . . . . . . . . . . . . . . . . . . . . . . 79
5.9 Framework Integration . . . . . . . . . . . . . . . . . . . . . . 85
5.10 Triggering System . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.11 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Conclusions 87
Acronyms 90
Introduction
Space missions, and scientific research in general, need to handle an ever-
increasing amount of data, recently stepping de facto into the big data terri-
tory. An infrastructure capable of managing this volume of data is required,
including a proper workflow manager. Adopting an external tool, and thus
avoiding the need to develop in-house software, can represent a time saving
choice that allows to divert more resources towards science related activities.
Luigi and Airflow are two workflow managers developed in an industrial
context, respectively by Spotify and Airbnb. They are software tools made
available to the public by means of an open-source license which, as a re-
sult, has allowed them to be constantly maintained and improved by a large
community.
The aim of this thesis is to test the feasibility of using Luigi or Air-
flow as workflow manager within a large scientific project such as the Euclid
space mission, which is the case study of this work. The test consists in a
comparison between the two workflow managers and the one currently used
by the Euclid Consortium developers: the Euclid Pipeline Runner (EPR).
The comparison is interesting for the scientific community since, through
that, it can have an overview of the already available tools, if they fit their
requirements and offer the necessary features. This can be a valuable oppor-
tunity to estimate the overall performances and characteristics a workflow
manager can provide, promoting the decision to add some components to
the in-house design or even adopt the external tool completely. In order to
capture the behavior of these tools during their operation, it is necessary to
iii
INTRODUCTION iv
define a pipeline. For this reason, the first step of this work was become
familiar with the Euclid development environment, within which, later, four
software projects were created following the constraints of the mission frame-
work. Afterwards, an Euclid-like scientific pipeline was built by means of the
four projects, obtaining a first pipeline running thanks to EPR. Other two
pipelines were then built, one for Luigi and one for Airflow, in order to obtain
all the necessary elements for the comparison.
The comparison between Luigi, Airflow and EPR was performed by means
of ten metrics, chosen based on the main needs expressed by the developers
of the euclidean environment:
• Execution time
• RAM memory usage
• CPU usage
• Error handling
• Usability and configuration complexity
• Distributed computing
• Workflow visualization
• Integration in a framework
• Triggering system
• Logging quality
The results obtained seem encouraging and suggest that the external work-
flow managers can actually be used within a space mission environment,
bringing some performance improvement and offering several additional fea-
tures compared to EPR. This work is structured as follow: in Chapter 1
an introduction to workflow managers will be made, explaining the need
to use them to manage large amounts of data and what role they play in
the scientific field. In Chapter 2 an overview of the Euclid mission will be
framed, touching its scientific objectives and the organizational structure be-
hind the part of the mission that handles the scientific data. Chapter 3 will
be dedicated to describing the development environment of the mission and
INTRODUCTION v
the development phases of the software realized for this work. In Chapter 4
Luigi and Airflow, the external workflow managers, will be introduced, along
with their main features and characteristics. Finally, Chapter 5 will focus on
the description of the comparison metrics and the results obtained.
Introduzione (Italian)
Le missioni spaziali, e in generale la ricerca scientifica, devono gestire una
quantit`a sempre maggiore di dati, entrando di recente nel territorio dei big
data. La gestione di un tale volume di dati presenta delle sfide che devo-
no essere affrontate con le giuste infrastrutture e disponendo degli strumenti
software appropriati, come, ad esempio, un workflow manager. Adottare uno
strumento esterno, evitando quindi di dover sviluppare in casa il software,
pu`o rappresentare una scelta in grado di far risparmiare tempo che pu`o es-
sere dedicato di conseguenza alle attivit`a prettamente scientifiche. Luigi e
Airflow sono due workflow manager sviluppati in un contesto industriale, ri-
spettivamente da Spotify e Airbnb. Essi sono degli strumenti software resi
disponibili al pubblico per mezzo di una licenza open-source che, di con-
seguenza, ha permesso loro di essere mantenuti e migliorati da un’ampia
comunit`a.
Lo scopo di questa tesi `e quello di valutare la possibilit`a di utilizzare Luigi
o Airflow come workflow manager all’interno di un grande progetto scientifico
come la missione spaziale Euclid, caso di studio per questo lavoro. L’ana-
lisi consiste nel confrontare i due workflow manager con quello attualmente
utilizzato dagli sviluppatori dell’Euclid Consortium: l’Euclid Pipeline Run-
ner (EPR). Questo confronto risulta interessante per la comunit`a scientifica
poich´e permette di ottenere una panoramica degli strumenti gi`a disponibili,
stabilire se essi rientrano nei requisiti del sistema ed offrono le caratteristiche
necessarie. Inoltre, risulta una preziosa opportunit`a per valutare le carat-
teristiche e le prestazioni complessive che un workflow manager pu`o offrire,
vi
INTRODUCTION vii
fornendo dei dati a supporto della decisione di adottare uno strumento ester-
no o aggiungere alcune componenti software al proprio progetto. Per ottenere
dei dati riguardo il comportamento di questi strumenti durante la loro ese-
cuzione, `e necessario definire una pipeline. Quindi, il primo passo di questo
lavoro `e stato quello di acquisire confidenza con l’ambiente di sviluppo di Eu-
clid, all’interno del quale sono stati successivamente creati quattro progetti
software rispettando i vincoli del framework della missione. Tali progetti so-
no stati poi combinati in cascata, ottenendo una prima pipeline Euclid-like
eseguibile con EPR. In seguito sono state costruite altre due pipeline, una
per Luigi e una per Airflow, cos`ı da disporre di tutti gli elementi necessari
per effettuare il confronto.
Il confronto tra Luigi, Airflow e EPR `e stato condotto per mezzo di dieci
metriche, scelte in base alle principali necessit`a espresse dagli sviluppatori
dell’ambiente euclideo:
• Tempo di esecuzione
• Utilizzo della memoria RAM
• Utilizzo della CPU
• Gestione degli errori
• Complessit`a di utilizzo e configurazione
• Capacit`a di operare in modalit`a distribuita
• Visualizzazione del flusso di lavoro
• Integrazione in un framework
• Capacit`a di avvio automatico delle esecuzioni
• Qualit`a dei log generati
I risultati ottenuti sembrano incoraggianti e suggeriscono che i workflow
manager esterni possano essere di fatto utilizzati all’interno dell’ambiente
di sviluppo di una missione spaziale, portando qualche miglioramento nelle
prestazioni e offrendo alcune caratteristiche aggiuntive rispetto ad EPR.
Questo lavoro `e strutturato nel modo seguente: nel Capitolo 1 verranno
introdotti i workflow manager, spiegando le necessit`a che spingono al loro uti-
INTRODUCTION viii
lizzo e il ruolo che essi possiedono nel campo scientifico. Nel Capitolo 2 sar`a
proposta una panoramica della missione Euclid, descrivendo i suoi obiettivi
scientifici e la struttura organizzativa che si occupa della gestione dei dati
scientifici. Il Capitolo 3 sar`a dedicato all’ambiente di sviluppo della missio-
ne e al software sviluppato durante questo lavoro. Nel Capitolo 4 verranno
introdotti i workflow manager esterni, Luigi e Airflow, descrivendo le loro
principali caratteristiche. Infine, il Capitolo 5 sar`a dedicato alla definizione
delle metriche di confronto e all’esposizione dei risultati ottenuti.
Chapter 1
Workflow Management
Data is a new extremely valuable resources and is collected in an increas-
ing pace, but the true value is represented by the information enclosed inside
it. For this reason, a wide range of new tools for big data processing have
been developed in the last years. A subset of these tools are called workflow
managers and they are in charge to coordinate the data processing steps.
Automation capability and fault tolerance are the main required features,
characteristics to be implemented in a distributed system. Also in the sci-
entific community we can observe an increasingly adoption of large volume
data acquisition [1]. This also apply to the field of astronomical research, the
area in which this work has been carried out.
In fact, the rapid evolution in computer technology and processing power
boosted the design of more and more complex surveys and simulations. For
instance, it was gradually possible to add extra dimensions to the data col-
lected, such time or a third spacial dimension [2]. This extra dimension can
be obtained by repeating views of the same object in order to spot transient
phenomena, or can be a 3D scan, mapping the sky along the depth axis, as
Euclid will be able to do.
As has often happened in the history of modern astronomy, also in this
1
CHAPTER 1. WORKFLOW MANAGEMENT 2
decade we are witnessing an explosion in the volume of the datasets used for
astronomy and the amount of data has increased by an order of magnitude
compared to just few years ago. Statistical analysis is now essential to make
new discoveries obtained thanks to the correlation of a large volume of data,
impossible to process with legacy methods where often wasn’t even used an
information system.
1.1 The Big Data Era
With the increasing demand of more and more information to improve
the accuracy of scientific research, the world of astronomy has face data
management problems that the industrial world has already begun to solve.
This new type of data is part of the phenomenon called big data, although
a precise definition of this entity has not yet been established. Big data are
identified through their characteristics, among which five are widely accepted:
Volume, Velocity, Variety, Veracity and Value, the 5 Vs of big data.
• Volume: refers to the amount of data collected that can no longer be
stored in a single node, but a complete system must be set up for the
correct and efficient management of the data. Furthermore, a structure
in the data is not guaranteed anymore and relational databases1
are no
longer the best choice, resulting in an increase in database management
systems that are no longer relational but which imitate an hash table
structure.
• Variety: represents the lack of homogeneity of the collected data,
coming from different sources, unstructured or semi-structured. These
types of data need a more intelligent processing chain that can adapt
to the case.
1
A relational database stores data in tables consisting in columns and rows. Each
column stores a type of data. Data in a table is related with a key, one or more columns
that uniquely identify a row within the table.
CHAPTER 1. WORKFLOW MANAGEMENT 3
• Velocity: refers to the volume generated per unit of time and also the
rate at which data must then be processed and made available. With-
out an appropriate distributed infrastructure it would be impossible to
carry out such a difficult task.
• Veracity: represents the guarantee that the data are consistent, reli-
able and authentic.
• Value: refers to an added value that without the use of data with the
previous characteristics, it would not be possible to obtain. More data
implies more accurate analysis and more reliable results.
1.1.1 The Arise of e-Science
Whatever the final purpose of the research, from exploring the extremely
small to mapping the vastness of the visible Universe, the aspects of data
management, their analysis and their distribution, are increasingly predom-
inant within scientific experimentation. The science that produces the large
amounts of data that effectively possess the characteristics of big data is
called e-Sciences, a term coined in the UK in 1999 by John Taylor, then
general director of the Office of Science and Technology, who faithfully antic-
ipated the direction of technological development that would have undertaken
the scientific field from that moment on. e-Science is therefore the technolog-
ical face of modern science, which produces and consumes large amounts of
data and, for this reason, must be supported by an adequate infrastructure
for storing, distributing and accessing the collected data. This infrastruc-
ture is often called Scientific Data e-Infrastructure (SDI). Meanwhile, the
term Cyberinstrastructure was created in the United States, which describes
the same information and infrastructural needs that e-Science implies [3, 4].
This shows how the phenomenon developed at the end of the 1990s and early
2000s was in fact involving a large part of the scientific community. From
that moment on, the demand for systems with improving performances and
capable of handling an ever-increasing data volume has become progressively
CHAPTER 1. WORKFLOW MANAGEMENT 4
more important. Adopting the Big Data paradigm for science was possible
thanks to the change in mentality started with e-Science, maturing in an
improvement of scientific instruments capable of collecting huge volumes of
data and of the SDI infrastructure capable of distributing them appropriately
[3].
1.2 Workflow Manager
New software tools, new architectures and new programming paradigms
have been developed for management and processing of large amounts of
data produced in response to new scientific and industrial needs. A subset of
the tools developed in this ecosystem are the workflow managers, employed
to build systems capable of working in a distributed environment and robust
enough to carry out their task without causing a complete stop of the system
in case of partial failure. A workflow manager is a software tool that helps
to define the dependencies among a set of software modules or tasks. We can
identify two main jobs the workflow manager has to accomplish: dependency
resolution and task scheduling. The dependency resolution is essential to
schedule the tasks in the right order and make sure every module is run if
and only if all its dependencies are completed successfully. The scheduler
has to decide when each task should be executed in order to optimize the
available resources usage.
1.2.1 Data Pipeline
A data pipeline is a task concatenation, where, generally speaking, the
execution’s result of a module becomes the input for the next one. In this
way, the modules can be developed independently and in a modular fashion,
where the only requirement to meet is the interface defined between the two.
This interface can be as simple as the information about the type of file and
CHAPTER 1. WORKFLOW MANAGEMENT 5
its location in the file-system. Two distinct approaches can be identified in
defining the pipelines and their workflows: the business and the scientific
one.
• A typical business workflow has features such as efficiency in execu-
tion, independence between different workflows and human involvement
in the process. Another characteristic of this approach is that a pipeline
is defined through a control-flow, i.e. the dependencies between tasks
are based on their status. For example, if a task X is dependent on task
Y, X is not executed until Y is in completed state. In the end, typically
the data are not streams, pipeline execution is not continuous but on
demand when there is a need.
• A scientific workflow has the task of producing outputs that are
the result of experimentation, the instances of different workflows are
in some degree correlated with each other and the automatism is ex-
ploited as much as possible. Although automation is important, so it
is the possibility to have access to intermediate results that an expert
can validate on the fly. The pipeline is focused on the data-flow and
no longer on the control-flow, i.e. a task is not executed until its in-
put is available. This approach is therefore called data-driven. The
data flow is described by a Directed Acyclic Graph (DAG) where each
node represents a task and the graph’s topological ordering defines the
dependencies. Data is often a continuous flow and all the tasks in a
pipeline are usually working on different data at the same time.
1.3 Towards Scientific Workflow Managers
For the reasons mentioned previously in this chapter, which include the
production of big data in the scientific field, there has been an increase in
the use of workflow managers in science, which is no longer conducted by the
individual but has become a joint effort of many organizations and national
CHAPTER 1. WORKFLOW MANAGEMENT 6
institutes. A scientific workflow manager must be the mean that allows a
research team to obtain the results and therefore must be as transparent as
possible to the user. Among its objectives we can find [4, 5]:
• Description of complex scientific procedures, hence reuse of workflow,
along modularity in the task construction, become important.
• Automatic data processing according to the desired algorithms, and
possibility to inspect intermediate results.
• Provide high performance computation capabilities with the help of an
appropriate infrastructure.
• Reduce the amount of time researchers spend working on the tools,
allowing them to spend more time conducting research.
• Decrease machine time, optimizing software execution instead of in-
creasing physical resources.
To move towards these objectives, a huge amount of new tools have been
developed within the scientific field, including programming languages and
whole systems. For example, more than a hundred different custom pipeline
managers have appeared in a short time, making it difficult to port systems
and code, leading to a lack of result reproducibility [6]. This situation does
not help scientific discoveries, which sometimes lose the ability to be testable
from independent parties. A solution could be to identify a standard to
adopt, or to make the tools developed free and easy to use. A step in the
right direction, perhaps, could be to use more generic systems able to satisfy
specific needs. One sector that needs generic tools is the industrial one, which
has in fact produced highly efficient and easy to use systems. These tools
are often not adopted by the scientific community that relies on products
developed in-house. The result is that similar technologies in the scope are
created independently by science and industry, thus missing an opportunity
to share their capabilities and resources.
CHAPTER 1. WORKFLOW MANAGEMENT 7
In this work we want to take the European Space Agency (ESA) Eu-
clid Space Mission [7] and his workflow manager as a case study and verify,
through an analysis of ten metrics, whether two popular tool for pipeline
management developed in the industrial world can be adopted within a space
mission. Contrary to what happens with scientific workflow managers, where
the purpose is to solve a specific instance of a certain problem, in the field
of IT companies and startups, the tendency is to take a path towards the
design of a software that is as generic as possible, flexible and adaptable to
different scenarios.
Chapter 2
Euclid Mission
The ESA’s mission Euclid is a medium class (M-class) mission, part of
the Cosmic Vision program. Its main objective is to gather new insights
about the true nature of dark matter and dark energy. In this chapter we
will explore the main mission characteristics and the basic science behind the
measurement of the universe.
2.1 Mission objective
Thanks to the ESA Planck mission1
, the researchers confirmed that many
questions about the nature of our Universe are currently open. As we see
in fig. 2.1 only the 4.9% of the matter we are surrounded by is Baryonic
Matter, i.e. what is commonly adressed as ordinary matter, such as all the
atoms made of protons, neutrons and electrons2
. Another 26.8% is Dark
Matter (DM), a component with high mass density that interacts with itself
an other matter only gravitationally. Moreover, it doesn’t interact with the
1
http://sci.esa.int/planck
2
Although the electron is a lepton, in astronomy it is included as part of the baryonic
matter because its mass, with respect to the mass of the proton and the neutron, is
negligible.
8
CHAPTER 2. EUCLID MISSION 9
Figure 2.1: Estimated composition of our Universe. Source:
http://sci.esa.int/planck.
electromagnetic force, making it transparent to the electromagnetic spectrum
and really hard to spot. The remaining 68.3% is what it’s called Dark Energy
(DE) for its unknown nature [8]. This component it’s linked with the accel-
erated expansion of the Universe, but currently there is no direct evidence
for its actual existence. Different models have been developed to explain the
nature of the effects observed and currently attributable to dark energy. In
order to gather more data that could bring more insights about the nature
of dark matter and dark energy, the approach chosen is to observe and an-
alyze two cosmological probes: the Weak Lensing (WL) and the Baryonic
Acoustic Oscillation (BAO). The WL effect is caused by a mass concentra-
tion that deflect the path of light traveling towards the observer. This effect
is detectable only measuring some statistical and morphological properties
of a large number of light sources. Euclid is expected to image about 1.5
billion galaxies, capturing useful data to study the correlation between their
shape, mapping with high precision the expansion and grow history of the
Universe [7]. A BAO is a density variations in the baryonic matter caused
by a pressure wave formed in the primordial plasma of the universe. Mea-
suring the BAO of the same source at different redshifts allows to estimate
CHAPTER 2. EUCLID MISSION 10
the expansion rate of the universe.
2.1.1 Spacecraft and Instrumentation
The Euclid space telescope is a 1.2 m Korsch architecture with 24.5 m
focal length. The spacecraft carries two main tools that will generate all
the data for the mission. They are both electromagnetic spectrum sensors,
one specialized for photometry in the visible wavelengths and the other for
infrared spectroscopy .
• VIS: or VISible instrument, will used to acquire images in the visible
range of the electromagnetic spectrum (550-900 nm). It is made of
36 CCDs, each counting 4069x4132 pixels (see fig. 2.3a). The weak
lensing effect will be measured through the data obtained thanks to
this instrument.
• NISP: or Near Infrared Spectrometer and Photometer has two com-
ponents: the near infrared spectrometer, operating in the 1100-2000
nm range and the near infrared imaging photometer, working in the
Y (920-1146 nm), J (1146-1372 nm) and H (1372-2000 nm) bands. It
is composed of 16 detectors, each counting 2040x2040 pixels (see fig.
2.3b). The main purpose of this instrument is to measure the BAO at
different redshifts.
The two instruments will have about the same field of view of 0.54 and 0.53
deg2
respectively, but VIS will offer a much greater resolution. Euclid has as
a requirement to do a wide-survey of at least 15,000 deg2
of sky, and possibly
reach 20,000 deg2
. In combination with this, another two deep-surveys of 20
deg2
each are planned [7].
CHAPTER 2. EUCLID MISSION 11
Figure 2.2: Schematic figure of the Thales Alenia Space’s concept of the
Euclid spacecraft. Source: http://sci.esa.int/euclid.
CHAPTER 2. EUCLID MISSION 12
(a) One of the 36 CCDs that will compose the VIS instrument. Source:
http://sci.esa.int/euclid. Copyright: e2v.
(b) One of the 16 CCDs that will compose the NISP instrument. Source:
http://sci.esa.int/euclid. Copyright: CPPM.
Figure 2.3: Actual flight hardware for the Euclid spacecraft.
CHAPTER 2. EUCLID MISSION 13
2.2 Ground Segment Organization
As it is in mostly every space mission, the Euclid mission has its own
space system made of three segments:
• Space Segment: which includes the spacecraft along with the com-
munication system.
• Launch Segment: which is used to transport space segment elements
to space
• Ground Segment: which is in charge of spacecraft operations man-
agement and payload data distribution and analysis.
Inside the Euclid ground segment we can distinguish two parts: the Opera-
tions Ground Segment (OGS) and the Science Ground Segment (SGS), the
latter managed in collaboration with ESA and the Euclid Mission Consor-
tium3
(EMC). Within the SGS, we can further identify three components:
• Science Operation Center (SOC): which is in charge of the space-
craft management and the execution of planned surveys.
• Instrument Operation Teams (IOTs): which are responsible for
instrument calibration and quality control on calibrated data.
• Science Data Centers (SDCs): which are in charge to perform
the data processing and the delivery of science-ready data products.
Moreover they have the job of data validation and quality control.
The amount of data expected from the mission, considering only the sci-
entific data, is roughly 100 GB per day with a total of 30 PB for the entire
mission. After all processing steps needed in order to obtain the science
3
https://www.euclid-ec.org
CHAPTER 2. EUCLID MISSION 14
Figure 2.4: Data processing organization and responsibilities. All science-
driven data processing is performed by the nine SDCs. Source: Romelli et
al. [8], ADASS XXVIII conference proceedings (printing)..
product, the EMC predicts about 100 PB of data to handle [9]. For this rea-
son, it’s one of the first time a space mission begins a path of modernization
towards an IT infrastructure capable of handling big data. Euclid adopts a
distributed architecture for storing and processing data. Referring to Figure
2.4, we see how data processing and management takes place at nine sites in
different countries. Each site is an SDC that manages a part of the data pro-
cessing steps. The Organization Units (OUs) are working groups specialized
in different aspects of the scientific data reduction and analysis. Each SDC
is in support of one or more Organization Units (OUs), relation represented
in the figure with a continuous line. The dashed line represents instead a
deputy support. The data products generated throughout the mission are
categorized in five processing levels:
CHAPTER 2. EUCLID MISSION 15
• Level 1: data are unpacked, decompressed as well as the associated
telemetry.
• Level 2: data are submitted to a first processing phase that includes
calibration step and a restoration one, where artifacts due to the in-
strumentation are removed.
• Level 3: data are in a form suitable for their scientific analysis and,
because of this, they are called science-ready data.
• Level E: external data coming from other missions or other projects.
Before being included in the processing cycle, they must be euclidised
to be consistent with the rest of the data.
• Level S: simulated data, useful in the period before the mission for
testing, validating and calibration of the systems developed for pro-
cessing the data.
In Figure 2.5, a diagram of how the data will be processed is shown. After
downloading the data, they are routed to the SOC that becomes the point
from which they are distributed. In the transformation of data from level 1 to
level 2, in addition to the data of VIS, and NISP (distinguished in NIR for the
photometric part and SIR for the spectroscopy one), external data are added,
coming from the so called level E. These data are coming from instruments
of other missions and this reflects the typical e-Science workflow. Equally
emblematic, it is the insertion of level S data, that is artificially generated
through simulations and have the purpose of testing the system before the
data coming from the satellite are available.
CHAPTER 2. EUCLID MISSION 16
Figure 2.5: Simplified data analysis flow. After downloading the data from
the satellite, they are distributed to the SDCs via the SOC. The level S and
E data, characteristic traits of e-Science, can be spotted. Source: Dubath
et al. [9].
2.3 SDC-IT
The Italian Scientific Data Center (SDC-IT) is one of the nine SDCs
involved in the mission. It is located at the Astronomical Observatory of
Trieste4
(OATs) and plays a role of both primary and auxiliary reference for
some OUs [8]. The author has carried out this work at the SDC-IT that has
made available its structure and its expertise, along with the software tools
used within the space mission. Euclid has its own development environment
called Euclid Development ENviroment (EDEN), which will be used at each
stage of this thesis development. In chapter 3 it will be described in detail,
as well as the software produced by the author.
4
http://www.oats.inaf.it
Chapter 3
Software Development
As first part of this thesis work, three main activities were carried out: the
first was to become familiar with the development environment used in the
context of the mission. The second activity was to implement three software
modules compliant with the Euclid rules that could be the foundation for
an elementary scientific elaboration of astronomical images. In conclusion,
the third activity involved the creation of a pipeline wrapping the modules
developed in the second phase. The result was then an Euclid-like scientific
pipeline running thanks to the workflow manager Euclid Pipeline Runner
(EPR), designed by the EMC. The development environment used was the
default one predisposed for the mission and it will be quickly illustrated in
this chapter. Subsequently the program used and the code produced will be
extensively described. All the software was written in Python 3, as foreseen
by the official Euclid coding rules.
3.1 Development Environment
The success of the Euclid mission depends on the collaboration of dozens
of public and private entities, involving hundreds of people. To ensure all
17
CHAPTER 3. SOFTWARE DEVELOPMENT 18
Feature Name Version
Operating System CentOS 7.3
C++ 11 compiler gcc-c++ 4.8.5
Python 3 interpreter Python 3.6.2
Framework Elements 5.2.2
Version Control System Git 2.7.4
Table 3.1: List of main features in EDEN 2.0. All tools and respective
versions are set for each environment. Source: Romelli et al. [8], ADASS
XXVIII conference proceedings (printing).
the software modules developed can in fact run together, the Consortium
defined a common environment and set of rules. The environment, called
Euclid Development ENviroment (EDEN), is a collection of frameworks and
software packages that enclose all the tools available to the developers. As
a distributed file system, the CernVM File System1
(CVMFS), developed at
CERN as part of its infrastructure, is used. EDEN is locally available to the
developers through LODEEN (LOcal DEvelopment ENviroment), a virtual
machine based on Scientific Linux CentOS 7. The latest stable version, the
2.0, is used in this work. In Table 3.1 the main EDEN 2.0 features are listed.
LODEEN is a local replica of the EDEN environment for Euclid devel-
opers embedded in a virtual machine. It runs on Scientific Linux CentOS 7
operating system with Mate desktop environment. The version 2.0 is used
in this work.
CODEEN It’s a Jenkins-based2
system in charge of performing the contin-
uous integration for all the source code developed inside the Euclid mission.
Jenkins is an open source automation tool the helps to perform build, testing,
delivering and deployment of software [10].
1
https://cernvm.cern.ch/portal/filesystem
2
https://jenkins.io
CHAPTER 3. SOFTWARE DEVELOPMENT 19
3.2 Elements
As one of the fundamentals components of EDEN, we can find Elements,
a framework that provides both CMake facilities and Python/C++ utilities.
CMake3
is an open-source tool for building, testing and package software.
Elements is derived from the CERN Gaudi Project4
[9], another open source
project that helps to build frameworks in the domain of event data processing
applications.
Every Elements project must follow a well defined structure in order to be
used inside the mission environment. Every project has to be placed inside a
default folder into the Linux file-system: /home/user/Work/Projects. This
allows to have a shared common location in the whole environment. Conse-
quently the name of each developed project has to be unique. Therefore the
three projects created by the author implement this required structure.
Environment variables Elements needs a few environment variables in
order to build and install the projects. Those come predefined in LODEEN
but they deserve a quick overview. The first variable we will see is the
BINARY TAG that contains the information for the build. It is composed by
four parts separated by a dash:
1. Architecture’s instruction set
2. Name and version number of the distribution
3. Name and version of the compiler
4. Type of build configuration
There are six types of build configurations, the default value is o2g and rep-
resent a standard build. For the specific case of this work, BINARY_TAG=
3
https://cmake.org
4
http://gaudi.web.cern.ch/gaudi
CHAPTER 3. SOFTWARE DEVELOPMENT 20
x86_64-co7-gcc48-o2g. The variable $CMAKE PREFIX PATH points to the
newest version of the Elements CMake library. In this case equals to /usr/
share/EuclidEnv/cmake:/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-2.
0/usr The environment variable CMAKE PROJECT PATH contains the location
of the projects. In this case the value is: /home/user/Work/Projects:
/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-2.0/opt/euclid, where the
first path indicates the user area for the development and the second the
system-installed projects. At the build moment a project resolves its depen-
dencies through this variable.
Project Structure An Elements project has to be organized in a well de-
fined structure:
ProjectName
CMakeLists.txt
Makefile
ModuleName1
ModuleName2
...
Once built the project by the means of the CMake command make, a
build.${BINARY TAG} folder is generated by the framework. Afterwards an
installation phase is required to execute the program and let other projects
use it as dependency. The structure then becomes:
ProjectName
CMakeLists.txt
Makefile
ModuleName1
ModuleName2
...
build.${BINARY TAG}
InstallArea
Module Structure A Elements module, referred in the project structure
as ModuleName1 or ModuleName2, is a reusable unit of software and can be
made of both C++ and Python code. It is thought as an independent unit
CHAPTER 3. SOFTWARE DEVELOPMENT 21
that can be placed in any project. The structure is:
ModuleName
CMakeLists.txt
ModuleName
src
python
script
auxdir
ModuleName
conf
ModuleName
test
Finally, an important part of Elements is the E-Run command, thanks to
which projects can be executed within the framework.
3.3 Image Processing
After the astronomical images are acquired, properly merged and cleaned,
it is possible to extract from them an objects catalog. These catalogs are then
used to perform the analysis. In this context an object is typically a light
source, such as a star, a galaxy or a galaxy cluster. The phase of extraction
can be simplified as a process that involves three main steps:
1. Detection: The detection phase is a process generally referred as seg-
mentation. From an astronomical image, the light sources are identified
and extracted from the background. Segmentation of nontrivial images
is one of the most difficult task in image processing [11] and represent
a critical phase of the scientific processing in order to correctly identify
the sources and obtain accurate results.
2. Deblending: Often light sources are not clearly separated from each
other and in the first place is not possible to distinguish the single
CHAPTER 3. SOFTWARE DEVELOPMENT 22
object. Is then necessary to run an additional step of deblending, a
procedure for splitting highly overlapped sources.
3. Photometry: The final step for obtaining a source catalog is the mea-
surement of their luminous flux. This is done by integrating the gray
level value of the pixels labeled as one object. Furthermore, is possible
to apply several masks in order to obtain more accurate measures.
After the execution of these steps, the output generated is a catalog, some
optional check images and an image highlighting the objects detected, as
shown, for instance, in fig. 3.5c.
3.4 Multi-Software Catalog Extractor
The author developed three Elements projects able to perform some basic
image processing. The whole system is thought to represent a prototype
of an Euclid-like project and it was essential to gain confidence with the
environment. Furthermore these projects were the building blocks of the
Euclid-like pipeline, used to test the workflow managers against the chosen
metrics.
The software was designed to follow the Object Oriented Programming
(OOP) paradigm and be easily adaptable to new packages or tools to perform
detection, deblending and photometry. As a whole, the software consists of
four main Elements projects and one supporting with utilities. They are:
• PT MuSoDetection,
• PT MuSoDeblending,
• PT MuSoPhotometry,
• PT MuSoUtils.
CHAPTER 3. SOFTWARE DEVELOPMENT 23
Indeed, in order to use another image processing tool, it is sufficient to ex-
tend the Python class Program located in the PT MuSoUtils project, at the
path PT_MuSoCore/python/PT_MuSoCore/Program.py, and implement the
abstract method run with the proper logic. Each software module developed
for the catalog extraction uses SExtractor5
as a image processing tool. This
software, created by Emmanuel Bertin, has the main purpose of extracting
an objects catalog from astronomical images [12] thought the execution of
detection, deblending and photometry. It is part of the set of legacy software
that, in this context, indicates external software that is officially integrated
within EDEN. Since SExtractor isn’t available in Python, the subprocess
module was used in order to create a new process that call the software
through a bash command. The final code is in the SExtractor class.
3.4.1 Developing and Versioning Methods
As version control system Git6
was used, following as much as possible
a clean and easy to understand develop workflow. It was chosen to follow
the Gitflow Workflow, defined for the first time by Vincent Driessen (see fig.
3.1) and later adopted as part of the Atlassian guide7
. This type of workflow
involves a main branch, by convention called master, and includes all the
commits that represent an incremental version of the software. In fact, no
developer can work directly on the master, which can only be merged with
other support branches, in particular a release (for lasts tweaks and version
number change) or a hotfix (used to fix critical issues in the production
software). When changes are merged into the master, by definition that
becomes a new product release [13]. As regards the versioning system, a
semantic versioning was applied to give a standard enumeration for each
project version. A version number is made by three numbers which in turn
represent a major release, a minor release and a patch. Each number has to
5
https://github.com/astromatic/sextractor
6
https://git-scm.com
7
https://it.atlassian.com/git/tutorials/comparing-workflows/
gitflow-workflow
CHAPTER 3. SOFTWARE DEVELOPMENT 24
Figure 3.1: A successful git branching model. This workflow was followed
during the Elements projects development. It favors a clear development
path that facilitates the creation of a software product, especially within large
teams. However, it remains a good pattern to follow even during autonomous
development. Source: Git Branching - Branching Workflows [14].
CHAPTER 3. SOFTWARE DEVELOPMENT 25
be incremented according to the following scheme [15]:
• Major version: when you make incompatible API changes.
• Minor version: when you add functionality in a backwards-compatible
manner.
• Patch version: when you make backwards-compatible bug fixes.
As a second branch that is always present in a gitflow repository, it is the
develop. In it are added the changes that are believed to be stable and ready
for a future release. In addition to the release and hotfix branches, a branch
of type feature can be used to develop new features that, once completed and
tested, are added to the develop branch.
3.4.2 Detection
The detection phase occurs by thresholding. Before proceeding, how-
ever, a filtering is necessary to smooth the noise that would otherwise cause
false positives: a Gaussian filter with standard deviation calibrated for the
particular case is used. Moreover, in astronomical images there is often a
non-uniform background that could generate ambiguities when the thresh-
olding is applied, especially in the most crowded areas. For this reason, as
a first step towards segmentation, a background map is generated that es-
timates the light outside the objects to be detected [16]. This map will be
subtracted from the image. As output of the detection we obtain an image
partitioned in N regions, the first one of which, marked by pixels with value 0,
is the background, while the remaining N-1 regions are the extracted objects
and are labeled with values from 1 to N.
CHAPTER 3. SOFTWARE DEVELOPMENT 26
Figure 3.2: Multithreshold deblending. This technique aims to deblend over-
lapping sources that have been detected as a single object. Source: Bertin
[16].
3.4.3 Deblending
After detection, the segmented image is subjected to a filtering phase that
aims to identify distinct overlapping objects that the first thresholding step
has recognized erroneously as a single object. During this phase, composite
objects are deblended using a multithreshold hierarchical method. An hint
on how the algorithm works ca be seen in fig. 3.2.
3.4.4 Photometry
The purpose of this last phase is to measure the luminous flux belonging
to each object identified after the deblending phase. For each set of pixels
labeled with the same number, the sum of the pixel values is calculated to
estimate the luminous flux. Some masks can be applied around objects,
called apertures, to obtain different type of measurements (see fig. 3.3).
CHAPTER 3. SOFTWARE DEVELOPMENT 27
Figure 3.3: Four different type of apertures available in SExtractor for the
photometric measurement. Source: Holwerda [17].
3.5 Utils Module
The PT MuSoUtils module contains five important packages:
• PT MuSoCore,
• IOUtils,
• SkySim,
• FITSUtils,
• DataModel.
Since each pipeline task can use a different program for performing its pro-
cessing step, a system for passing the parameters independently of the rest
of the code has been created. The configuration parameters can be spec-
ified in a JSON8
file that a parser developed for the occasion can read.
8
https://www.json.org
CHAPTER 3. SOFTWARE DEVELOPMENT 28
The language was chosen for its human readable—and writable—structure
and it’s usually very easy to handle inside the code, with a great support
in Python (as it is in many other programming languages). Any program
can have its own JSON file named <program>.json located in the path
PT_MuSoUtils/PT_MuSoCore/auxdir/config. The structure of the file fol-
lows this base scheme:
{
"program ": {
"full_name ": <program_name >,
"command_name ": <cmd_key >
},
" configurations ": {
<configuration_type >: {
<configuration_unit >: {
<parameter_name >: <parameter_value >
}
}
}
The configuration file is a JSON object with 2 names: program and confi-
gurations.
The value of program is a nested JSON object that expects two keys with
string value:
• full name: indicates the name of the program and is useful for logging
purposes.
• command name: indicates the terminal command to call if the pro-
gram is executed through subprocess.
The value of configurations is a nested JSON object that accepts any number
of items representing a configuration type. Each configuration type is a nested
JSON object that accepts any number of configuration units. Each configu-
ration unit in turn accepts any number of key/string items representing an
input parameter for the application.
CHAPTER 3. SOFTWARE DEVELOPMENT 29
• configuration type: is a collection of configurations units, and it is
intended to group all the parameters needed for a certain task, e.g. the
detection or deblending.
• configuration unit: is a collections of parameters that can be used
as a modular unit or a preset for some particular instance of the con-
figuration type, e.g. the detection of light sources in a crowded region
of the sky.
A configuration unit can be defined modular because another configuration
units, within the same file, can inherit its parameters without the need to
rewrite them, in an OOP fashion. In order to do so, is sufficient to add
the key inherit that expects a JSON array as value. The items of the ar-
ray are the configuration unit keys from which inherit the parameters. If
the configuration units conf unit 1 in the configuration type conf type 1
wants to inherit the configuration unit conf unit 2 contained in the con-
figuration type conf type 2, then the value of the key inherit has to be
”conf unit 2.conf type 2”. For example:
" configurations ": {
"base ": {
"catalog_ascii ": {
"-CATALOG_TYPE ": "ASCII_HEAD"
}
},
"detection ": {
" background_checkimages ": {
"inherit ": [
"base.catalog_ascii"
],
...
}
}
}
If two parameters have the same key, only the last value encountered by the
parser is preserved. Each type of configuration can be called by a different
CHAPTER 3. SOFTWARE DEVELOPMENT 30
project and some of them require to load other external file from its own
auxiliary directory. For this reason, a constant auxdir can be define inside
a configuration type and the parser will replace at run-time any sub-string
matching ”{auxdir}” with the value of the constant.
3.5.1 The Masking Task
Since the deblending task is a refinement phase with respect to the de-
tection, it needs to know the segmented image, but it can not attempt to
deblend an object if it is made up of pixels with all the same value and
isn’t representative of the original source. For this reason, as input of the
deblending, we must put the original image masked with the result of the de-
tection. Pixels that have been labeled as background are brought to a value
of -1000 in order to not interfere with the thresholding, while the remaining
pixels are left unchanged. For this purpose an additional Python module was
developed, called FITSMask.
3.5.2 FITS Images
The Flexible Image Transport System (FITS) format was created for shar-
ing astronomical images for sharing among observatories. The control au-
thority for the format is the International Astronomical Union - Fits Working
Group9
(IAU-FWG). The need for this standard stems from the difficulty of
standardizing the format among all observatories with different characteris-
tics and the impossibility of create adapters among all the different formats.
Consequently it was created a standard for which every observatory is able
to transform data from FITS to its own internal format, and vice versa.
A FITS file is composed of blocks of 2880 bytes, organized in a sequence
of Header and Data Unit or HUD eventually followed by special records. The
9
https://fits.gsfc.nasa.gov/iaufwg
CHAPTER 3. SOFTWARE DEVELOPMENT 31
header is consisting of one or more blocks of 2880 bytes and contains stand
alone information or metadata that describes the subordinated unit of data
[18]. To manipulate this format several python modules are available, such
as the astopy.io.fits, used in this work within the FITSUtil module. It was a
necessary step for the Euclid-like pipeline because the mission uses this file
format for storing and distributing the data collected by the space telescope.
3.5.3 SkySim
As final part of the Euclid-like pipeline, SkySim, a basic sky simulator,
was developed, which has the capability of generating synthetic astronomical
images starting from a source catalog. The only purpose of this module
was to briefly validate the pipeline and check if the catalog extracted was
consistent with the one put as input. First, an image with no sources was
generated (fig. 3.4), followed by an image with one source (fig. 3.5), with two
identical sources (fig. 3.6), with two sources with different magnitude (fig.
3.7), with two overlapping sources (fig. 3.8) and finally with two overlapping
sources with different magnitude (fig. 3.9). Each of this tests gave a positive
result and the pipeline performed as expected, extracting the right number
of sources with the correct photometric estimation.
Real images are typically degraded by noise. SkySim has the ability to
add additive noise with Gaussian distribution to an image according to a
(a) Original (b) Deblended (c) Photometry
Figure 3.4: No sources
CHAPTER 3. SOFTWARE DEVELOPMENT 32
(a) Original (b) Deblended (c) Photometry
Figure 3.5: One source
(a) Original (b) Deblended (c) Photometry
Figure 3.6: Two identical sources
(a) Original (b) Deblended (c) Photometry
Figure 3.7: Two sources with different magnitude
(a) Original (b) Deblended (c) Photometry
Figure 3.8: Two identical overlapping sources
CHAPTER 3. SOFTWARE DEVELOPMENT 33
(a) Original (b) Deblended (c) Photometry
Figure 3.9: Two overlapping sources with different magnitude
predetermined Signal-to-Noise Ratio (SNR) value in dB. A normal Gaussian
noise is defined by its mean µN and its standard deviation σN . Given the
desired SNR to obtain in the synthetic image I, it’s possible to calculate the
σN of the Gaussian noise to add:
σN =
var(I)
10SNR/10
By adding the pixel-by-pixel the values of the matrices that represent the
noise and the original image, SkySim outputs an image with the chosen
SNR.
3.6 Pipeline Project
The pipeline project is the part of the developed software that aims to
define and build the Euclid-like pipeline. Two files are required in order
to define a pipeline: the Package Definition and the Pipeline Script. In
the package definition file, as required by the Euclid Pipeline Runner, four
Executable Python objects were implemented, one for each task. In listing
3.1 is represented the first task of the pipeline and performs the detection
phase. Its input and outputs are specified as paths relative to workdir. The
Executable of masking (listing 3.2), deblending (listing 3.3) and photometry
(listing 3.4) have been defined in a completely similar manner.
CHAPTER 3. SOFTWARE DEVELOPMENT 34
1 p t s e x t r a c t o r d e t e c t i o n = Executable (
2 command=’ ’ . j o i n ( [
3 ’E−Run PT MuSoDetection 0.2 ’ ,
4 ’ SourceDetectorPipeline ’ ,
5 ’ sextractor ’ ,
6 ’−−preset p i p e l i n e ’
7 ] ) ,
8 inputs =[Input ( ’ f i t s i m a g e ’ ) ] ,
9 outputs =[Output ( ’ segmentation map ’ ) ,
10 Output ( ’ catalog ’ , mime type=’ txt ’ )
11 ]
12 )
Listing 3.1: Definition of the detection task
1 mask image = Executable (
2 command=’ ’ . j o i n ( [
3 ’E−Run PT MuSoUtils 0.2 ’ ,
4 ’FITSMask ’ ,
5 ’−−mask value −1000 ’
6 ] ) ,
7 inputs =[Input ( ’ f i t s i m a g e ’ ) ,
8 Input ( ’mask ’ ) ] ,
9 outputs =[Output ( ’ masked ’ ) ]
10 )
Listing 3.2: Definition of the masking task
1 pt sextractor deblending = Executable (
2 command=’ ’ . j o i n ( [
3 ’E−Run PT MuSoDeblending 0.2 ’ ,
4 ’ SourceDeblenderPipeline ’ ,
5 ’ sextractor ’ ,
6 ’−−preset p i p e l i n e ’
7 ] ) ,
8 inputs =[Input ( ’ f i t s i m a g e ’ ) ] ,
9 outputs =[Output ( ’ segmentation map ’ ) ,
10 Output ( ’ catalog ’ , mime type=’ txt ’ ) ]
11 )
Listing 3.3: Definition of the deblending task
CHAPTER 3. SOFTWARE DEVELOPMENT 35
1 pt sextractor photometry = Executable (
2 command=’ ’ . j o i n ( [
3 ’E−Run PT MuSoPhotometry 0.2 ’ ,
4 ’ SourcePhotometerPipeline ’ ,
5 ’ sextractor ’ ,
6 ’−−log−l e v e l debug ’ ,
7 ’−−preset p i p e l i n e ’
8 ] ) ,
9 inputs =[Input ( ’ f i t s i m a g e ’ ) ,
10 Input ( ’ f i t s i m a g e 2 ’ ) ,
11 Input ( ’ assoc catalog ’ ) ] ,
12 outputs =[Output ( ’ apertures map ’ ) ,
13 Output ( ’ catalog ’ , mime type=’ txt ’ ) ]
14 )
Listing 3.4: Definition of the photometry task
In addition the pipeline script file was created, where the source extractor
method, decorated with the decorator @pipeline, provided by the euclidean
framework (see listing 3.5). The logic inside specifies how to build the pipeline
that will be executed by Euclid Pipeline Runner.
1 @pipeline ( outputs=( ’ segmentation map ’ ,
2 ’ apertures map ’ ,
3 ’ catalog ’ ) )
4 def s o u r c e s e x t r a c t o r ( image ) :
5 seg map , = p t s e x t r a c t o r d e t e c t i o n ( f i t s i m a g e=image )
6 masked = mask image ( f i t s i m a g e=image , mask=seg map )
7 seg map deb , deb catalog =
8 pt sextractor deblending ( f i t s i m a g e=masked )
9 apertures , catalog =
10 pt sextractor photometry ( f i t s i m a g e=masked ,
11 f i t s i m a g e 2=image ,
12 assoc catalog=deb catalog )
13 return seg map deb , apertures , catalog
Listing 3.5: Definition of the Euclid-like pipeline that will be executed by
the EPR.
Chapter 4
External Workflow Managers
In the thesis work it is proposed to compare the workflow managers. As
starting point, it was decided to use the Euclid-like pipeline already developed
and to create two more pipelines implemented by means of two external
tools: Spotify’s Luigi and Airbnb’s Airflow. They have been chosen because
they are written in Python, thus EDEN compliant, are open-source and very
popular within the data flow topic.
4.1 Luigi
Luigi1
is a python package developed by the Spotify team and released
in 2012 under the Apache License 2.02
. This tool helps to build pipeline of
batch jobs [19]. Its features include: workflow management, task scheduling
and dependency resolution. One of the Luigi’s strengths is to manage failures
in a smart way, providing a build-in system for task status checking. In fact,
if a task fails and has to be rescheduled and rerun, Luigi goes across the
1
https://github.com/spotify/luigi
2
The Apache License 2.0 is the second major release of the permissive free software
license written by the Apache Software Foundation. See https://www.apache.org/
licenses/LICENSE-2.0.
36
CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 37
dependency graph backwards until it encounters a successfully completed
task. It then reschedules only the tasks downstream of the graph from that
point, thus not scraping the work done without failures. This can save a
lot of time and computing resources if the failures are not infrequent or if a
pipeline shares some tasks already executed by another pipeline.
4.1.1 Pipeline Definition
A Luigi’s pipeline is made of one or more tasks. For each task we can
define an input, an output and the business logic to execute. The input
is the set of parameters passed and the dependency list. The output is
generally a file, called target, that will be written to the local file-system or
to a distributed one, such as Hadoop Distributed File System (HDFS).
Target The target is the product the task has to yield and it’s defined
through a file-system path. In Luigi we can find the class Target that rep-
resent this concept. As default configuration, the existence of the file is the
proof that the task did execute successfully and its status is completed or
not. It’s possible however to override the default logic for deciding whenever
the task has to be considered done.
Task A task consists of four fundamental parts:
• Parameters: are the input arguments for the Task class and, with
the class name, uniquely identify the task. They are defined inside the
class using a Parameter object or a subclass of it.
• Dependencies: are defined through a task collection and set which
other tasks have to be executed successfully before the current one can
start. Such collection is the object to return in the overridden method
requires.
CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 38
• Business logic: is defined within the overridden method run. This
part of code is in charge to produce and store the output of the task.
• Outputs: are defined through a collection of Target objects. Each
target must points to the exact location of the file created by the busi-
ness logic.
Tasks Execution All the tasks defined in a Luigi pipeline are executed
in the same process making the debugging process clear, but also setting a
limit for the number of task a pipeline can be made of. Generally speaking,
however, it doesn’t represent a real problem until thousands of tasks are
executed in the same pipeline [19]. The execution of a task follows this
steps:
1. Check if the predicate that defines the completed status is satisfied. If
it is, then check the next task in the graph. If the current task is the
last one, the pipeline is completed.
2. Resolve all the dependencies. If one task from the dependencies is not
completed, then execute that task.
3. Execute the run method.
In order to start the entire pipeline, is sufficient to call the last task defined
in it and, thanks to the recursive algorithm, all tasks will be executed. Luigi
does not come with an embedded triggering system, but it can be easily
implemented.
4.1.2 Integration with Elements
It’s not obvious that two frameworks can operate together and be easily
integrated. In this section we will see how the author proposed to accomplish
CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 39
the task of developing a Luigi pipeline that executes the Elements projects
previously written and described in Chapter 3.
Task Implementation In the case of this work, the logic to execute is
essentially a call of the Elements project through the bash command E-Run.
The command is preset in EDEN and is associated with the execution of
the script that starts the Elements execution. The ExternalProgramTask,
subclass of Task, was then used. This class is part of the Luigi’s contrib
module and it manages the logic of the run method and exposes another
method, program args, whose return output is the list of strings that will be
the argument for the subprocess.Popen class. It was chosen to implement
the pipeline following the behavior of EPR, so to get a similar to use system.
As input, a path of the working directory and the data model file are required.
Optionally it can be specified an id as parameter that allows to differentiate
otherwise identical tasks (mainly for test and debug purposes). Each task has
its own target file defined inside. The path must be relative to the working
directory. The return value of the program args method is simply the string
seen in listings 3.1 - 3.4 put as command argument.
4.2 Airflow
Airflow3
is a python package developed by Airbnb and released in 2015
under the Apache License 2.0. In March 2016 the project joined the Apache
Software Foundation’s incubation program4
and since then is growing very
quickly. The Incubator project is the path to become part of the Apache Soft-
ware Foundation (ASF) for projects that want to contribute to the Apache
foundation. All the code that will become part of the ASF must first pass
through an incubation period within this project [20]. Airflow is a tool for de-
scribing, executing, and monitoring workflows. One of the airflow’s strengths
3
https://airflow.apache.org
4
https://incubator.apache.org.
CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 40
is the simplicity with which we can define the pipeline, though it doesn’t offer
a native way for specifying how the intermediate results are managed. More-
over it presents a great interactive graphic user interface that makes easy to
monitor the execution progress and state.
4.2.1 DAG Definition
In Airflow every pipeline if defined as a Directed Acyclic Graph (DAG).
As a matter of fact, the tool offers a Python class DAG that contains all
the information needed for the tasks execution, such as dag id, dependency
graph, start time, scheduling period, number of retry allowed and many
other options. A task is a node of the dependency graph and is coded as an
object of a class BaseOperator. That class is abstract and is designed to be
inherited to define one of the three main operator types:
• Action operator, that perform an action or trigger one
• Transfer operator, in charge of moving data from one system to an-
other
• Sensor operator, that runs until a certain criterion is satisfied, like the
existence of a file or a time in the day is reached.
4.2.2 Pipelining Elements Projects
As done with Luigi and described in Section 4.1, the author developed an
Airflow pipeline inside the Elements framework for executing the projects de-
scribed in Chapter 3. First a DAG object has to be initialized with few manda-
tory arguments as input: id, default arguments and schedule interval.
Listing 4.1 shows how the dag can be created.
1 d e f a u l t a r g s = {
2 ’ owner ’ : ’ a i r f l o w ’ ,
CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 41
3 ’ s t a r t d a t e ’ : datetime . utcnow () ,
4 ’ email ’ : [ ’ airflow@example . com ’ ] ,
5 ’ r e t r i e s ’ : 1 ,
6 ’ retry delay ’ : timedelta ( minutes=1)
7 }
8
9 dag = DAG( dag id=’ c a t a l o g e x t r a c t o r ’ , d e f a u l t a r g s=default args ,
10 s c h e d u l e i n t e r v a l=’ @once ’ )
Listing 4.1: Definition of the main DAG with Airflow for the catalog extractor
pipeline.
To define a task performed by a bash command, there is the BashOperator
object, which extends the BaseOperator. The string representing the bash
command to be executed can be put as input of the operator. An example
is shown in listing 4.2.
1 cmd = ’ ’ . j o i n ( [ ’E−Run ’ ,
2 ’ PT MuSoDetection ’ , ’ 0.2 ’ ,
3 ’ SourceDetectorPipeline ’ ,
4 ’ sextractor ’ ,
5 ’−−preset ’ , ’ p i p e l i n e ’ ,
6 ’−−workdir ’ , work dir ,
7 ’−−l o g d i r ’ , logdir ,
8 ’−−f i t s i m a g e ’ , i n p u t f i l e ,
9 ’−−segmentation map ’ , d e t e c t f i t s ,
10 ’−−catalog ’ , detect cat
11 ] )
12
13 detection task = BashOperator (
14 task id=’ detection ’ ,
15 bash command=cmd,
16 dag=dag )
Listing 4.2: Definition of the detection task with Airflow.
After defining the tasks, there are several ways to define the dependency
graph. In this case it was chosen to use the shift operator which is overloaded
in BaseOperator to define the concatenation.
Chapter 5
Comparison
The purpose of this work is to make a comparison between different work-
flow managers from different contexts and verify, through a test on the case
study of Euclid, if such external tools can be adopted as an integral part of
the system. Four Elements projects have been developed that act as building
blocks for the pipelines and the EPR, Luigi and Airflow workflow managers
have been studied. In this chapter we will propose ten metrics to evaluate
whether Luigi or Airflow could be used within the mission. Subsequently, the
executions necessary for the comparison will be performed and the results
will be analyzed.
5.1 Metrics
At this point of the work everything is set to execute the pipelines and
gather the data needed for the comparison. As comparison metrics we pro-
posed:
• Execution time: is an indicator of the overhead of the tool and shows
how the scheduler handles the executions.
42
CHAPTER 5. COMPARISON 43
• RAM memory usage: will give a measure of the memory usage
needed by the tool. It’s important because the resources are limited.
• CPU usage: will give a measure of the CPU usage needed by the tool.
It’s important because the resources are limited.
• Error handling: will show how the tool can handle failure inside the
system. It’s critical for automatic recovery and minimization of human
intervention.
• Usability and configuration complexity: will show how complex
the tool is to use and configure.
• Distributed computing management: will show the capability of
the tool to execute tasks in a distributed environment.
• Workflow visualization: is important for monitoring the execution
progress and spot any problem inside the system.
• Integration in a framework: will show if the tool is easy to use
inside an existing framework, critical for the current case study.
• Triggering system: will show if the tool is capable of automating the
execution triggering.
• Logging quality: will show if the log produced are in fact useful for
the developers to debug the system in case of failure.
At the end of the data collection we’ll draw the conclusions comparing the
two types of workflow manager. The machine used for the test has an Intel(R)
Core(TM) i7-4770 CPU @ 3.40 GHz, with 4 cores dedicated to the virtual
machine and a total dedicated memory RAM of 10.46 GB.
CHAPTER 5. COMPARISON 44
5.2 Execution Time
In order to profile the three workflow managers with respect of time, it
was sufficient to use the Linux built-in command time and some generated
logs. Each tool was setup to run with no delay and minimum waiting time
and was tested with three different images put in input to the pipeline: a
small one 252x252 pixels, a large one 5000x5000 and another small image
256x256 generated by SkySim, referred as simulated. All the results are
averaged over 10 runs. The time command gives as result three numbers:
real time, user time and sys time. The meanings are [21]:
• real: or wall clock time, indicates the total time elapsed from start to
finish.
• user: CPU time spent within the process outside the Linux kernel, or
in user mode. In user mode, the process can’t directly access hardware
or reference memory. Code running in this mode can perform lower
level accesses only through system APIs.
• sys: CPU time spent within the process inside the Linux kernel, or
in kernel mode. In kernel mode, the process can execute any CPU
instruction and access any memory address, without any restriction.
This mode is generally reserved for low-level functions of the operating
system and must therefore consist of trusted code. There are some
privileged instructions that can only be executed in kernel mode. These
are interrupt instructions, input output management and so on. If this
type of instructions are executed in user mode a trap will be generated.
From tables 5.1, 5.2 and 5.3 we can notice three main differences. First of
all, EPR will execute all three pipeline instances in roughly the same time,
due to internal scheduling settings that are not meant to be changed by the
user. Secondly, although ERP and Airflow perform about the same with
the large image as an input, in the case of the two smaller images, Airflow
CHAPTER 5. COMPARISON 45
EPR Luigi Airflow
real 40.419 16.665 38.325
user 21.258 14.570 21.194
sys 8.346 0.922 1.688
Table 5.1: Time needed for executing the scientific pipeline on the large
image. Values are in seconds, divided by real, user and sys time.
EPR Luigi Airflow
real 40.413 2.652 30.171
user 21.481 1.607 8.701
sys 8.649 0.298 1.022
Table 5.2: Time needed for executing the scientific pipeline on the small
image. Values are in seconds, divided by real, user and sys time.
EPR Luigi Airflow
real 40.409 2.634 30.008
user 21.871 1.781 8.526
sys 8.816 0.347 1.181
Table 5.3: Time needed for executing the scientific pipeline on the simulated
image. Values are in seconds, divided by real, user and sys time.
can complete the pipeline faster, reducing the real time accordingly with the
lowering of user and sys times. This means that Airflow keeps the overhead
steady in each execution. EPR, in the other hand, needs always the same
real, user and sys times in order to complete all task of the pipeline. This
lead to conclude that EPR has time slots in which it performs the tasks and
this behavior can not be modified. It must be noted that the sys time is
consistently much higher than the other instruments. Finally, Luigi confirms
itself as the most lightweight workflow manager among the three from a time
perspective. Its scheduler executes each tasks right away as soon as enough
system resources are available.
CHAPTER 5. COMPARISON 46
5.3 Memory Usage
As memory profiling tool it was chosen the Python memory profiler1
, a
module for recording the memory consumption of a process and its children
based on psutil2
. It was used in the time-based configuration, where the
memory usage is plotted as function of execution time. Indeed it’s possible
to directly plot the data after recording them by means of the same module.
In order to start the monitoring, the command mprof run <script> can be
used. After the script execution is done, the command mprof plot, shows
the plot of the last run recorded. Because all pipelines are built with Elements
projects as tasks, all of them use at least one subprocess in order to run.
For this reason it was set the flags --include-children (or -C for short)
first and then --multiprocess (or -M for short) to tell memory profiler to
consider all the children created by the main process. The include children
flag adds the memory used by the main process and all its children, obtaining
a single comprehensive value each instant. The multiprocess flag considers all
children independently, keeping the memory usage data separated for each
of them. In this section MB will be used as synonym of MiB, equivalent to
220
bits, and GB as synonym of GiB, or 230
bits.
This method was used to profile all three workflow managers, although it
was not possible to obtain meaningful results from the EPR execution, due to
its implementation. The workaround will be described later in this chapter.
After profiling the three workflow managers, each one executing the scientific
pipeline with the three images as inputs, the plots in figs. 5.1 - 5.9 were
obtained. Figures 5.4 - 5.9 show, as expected, an increase in memory usage
due to image processing in correspondence with the four tasks execution.
In the other hand, in figure 5.1 - 5.3, plot of EPR RAM usage, seems that
no execution is detected. Referring to fig. 5.1b compared to figs. 5.4b
and 5.7b it’s evident how the peak memory usage is not compatible with the
amount required by the second task execution (about 350 MB). Furthermore,
1
https://pypi.org/project/memory-profiler
2
https://pypi.org/project/psutil
CHAPTER 5. COMPARISON 47
the three executions, that have to process images with significant difference
in size, seems to require the same amount of memory with an average of
roughly 45 MB. Analyzing the data gathered per process, we can see that
every execution presents a main process of 38 MB and a main child process
of 13 MB. One possible explanation for this behaviour could be the resources
limitation by the EPR, but this limit was set to 1 GB, significantly below
the values obtained and the amount needed by the task. Consequently, it
was conduct a deeper study of the Python source code related to scheduling
and pipeline execution in the euclidean software. It was found out that the
tool is in part built by means of the package Twisted3
and its reactor
component, to which the actual task execution is delegated. Reactor works
in background in a separated thread and communicates with the main thread
though a callback system. This behavior is not detected by the profiler, which
records only the memory used by EPR and Twisted, explaining why all plots
shared a common trend. This finding has led to try other profilers but none
seemed to work properly in this situation. It was then decided to develop a
custom tool for memory profiling, using the features of top bash command.
It has the ability to report the memory and CPU usages, which turned out
to be very useful also in the CPU load analysis.
3
https://twistedmatrix.com
CHAPTER 5. COMPARISON 48
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant
of time in which the maximum peak occurred. The memory usage is similar
throughout the duration of profiling.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Also in this case, allocations of memory compatible with the image
processing under examination are not noticed.
Figure 5.1: EPR RAM memory usage versus time (large image).
CHAPTER 5. COMPARISON 49
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant
of time in which the maximum peak occurred. The memory usage is similar
throughout the duration of profiling, as it happens in fig. 5.1.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Also in this case, allocations of memory compatible with the image
processing under examination are not noticed.
Figure 5.2: EPR RAM memory usage versus time (small image).
CHAPTER 5. COMPARISON 50
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant
of time in which the maximum peak occurred. The memory usage is similar
throughout the duration of profiling, as it happens in fig. 5.1
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Also in this case, allocations of memory compatible with the image
processing under examination are not noticed.
Figure 5.3: EPR RAM memory usage versus time (simulated image).
CHAPTER 5. COMPARISON 51
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Four main peaks are distinctly visible, they are associate with the
execution of the four tasks in the pipeline. Three of them are SExtractor processes
(in green), while the highest peak in red represent the memory usage of the Python
masking project.
Figure 5.4: Luigi RAM memory usage versus time (large image).
CHAPTER 5. COMPARISON 52
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored.
Figure 5.5: Luigi RAM memory usage versus time (small image).
CHAPTER 5. COMPARISON 53
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored.
Figure 5.6: Luigi RAM memory usage versus time (simulated image).
CHAPTER 5. COMPARISON 54
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Airflow needs seven processes in order to execute the pipeline.
Figure 5.7: Airflow RAM memory usage versus time (large image).
CHAPTER 5. COMPARISON 55
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Airflow needs seven processes in order to execute the pipeline.
Figure 5.8: Airflow RAM memory usage versus time (small image).
CHAPTER 5. COMPARISON 56
(a) Total memory usage, sum of the memory used by the main process and all its
children. The horizontal dashed red line marks the limit of maximum utilization
during the execution, while the vertical one is in correspondence of the instant of
time in which the maximum peak occurred. Four main peaks are distinctly visible,
they are associate with the execution of the four tasks in the pipeline.
(b) Memory usage divided by process: the main is shown in black, while its children
are colored. Airflow needs seven processes in order to execute the pipeline.
Figure 5.9: Airflow RAM memory usage versus time (simulated image).
CHAPTER 5. COMPARISON 57
5.3.1 Top Based Profiling Tool
Euclid Pipeline Runner has a structure that doesn’t allow to profile the ex-
ecution directly. In fact it run asynchronously the tasks though a Twisted job
submission. The top command allows to obtain a resources usage overview
on the whole system in a particular time instant ass overall statistics and di-
vided by process. The visualization is interactive and it’s possible to display
processes sorted by memory or CPU usage. Top allows also to specify the
time interval between samples, in this case 0.1 s. The flag -c tells to show
the command line associated with the process instead of just the program’s
name. This was necessary in order to include the right processes in the pro-
filing. In fact, top shows a system snapshot and it’s crucial to manually
filter all and only the wanted tasks. Finally it was set the -b flag to run the
command in batch mode, to be used when it’s required to redirect the output
into a file. Each sample was indeed written to a file, which was subsequently
pre-processed until obtain a Python array ready for the actual analysis. At
the end, the bash command was top -d 0.1 -c -b > out.top. The pre-
processing and processing phases were written in a Python script. The steps
used for the pre-processing were:
1. Load the content of the top file as string;
2. Extraction of lines containing the keywords that identify the processes
to profile;
3. Additional processes filtering to remove unwanted items;
4. Replacement of multiple blank lines with a single one;
5. Text strip to remove void parts at the beginning and end of the file
content;
6. Columns extraction, keeping only the memory and CPU load values;
7. Split string with blank line as separator, obtaining a sample array,
CHAPTER 5. COMPARISON 58
0 10 20 30 40
time (in seconds)
100
200
300
400
memoryused(inMiB)
Figure 5.10: EPR memory usage versus time obtained with the custom pro-
filer (large image). Unlike what we saw in fig. 5.1a, in this case the four
peaks corresponding to the tasks execution are clearly visible.
where a sample contains the values of one or more simultaneous pro-
cesses;
8. Split sample lines with space as separator, obtaining a values array for
each process;
9. For each sample, sum corresponding values of all relative processes,
obtaining one value per parameter per sample.
CHAPTER 5. COMPARISON 59
0 5 10 15 20 25 30 35 40
time (in seconds)
20
40
60
80
100
memoryused(inMiB)
Figure 5.11: EPR memory usage versus time obtained with the custom pro-
filer (small image). Unlike what shown in fig. 5.2a, in this case the four
peaks corresponding to the tasks execution are clearly visible.
CHAPTER 5. COMPARISON 60
0 5 10 15 20 25 30 35 40
time (in seconds)
20
40
60
80
100
memoryused(inMiB)
Figure 5.12: EPR memory usage versus time obtained with the custom pro-
filer (simulated image). Unlike what shown in fig. 5.3a, in this case the four
peaks corresponding to the tasks execution are clearly visible.
CHAPTER 5. COMPARISON 61
The profiling has been repeated also for Luigi and Airflow in order to
have a check on the reliability of the method. The results obtained with
memory profiler and this tool was compared, verifying the consistency of
the two. The plots obtained by means of the custom profiler are shown in
Figs. 5.13 and 5.14. The results proved to be consistent with those previously
obtained.
CHAPTER 5. COMPARISON 62
0 2 4 6 8 10 12 14
time (in seconds)
0
50
100
150
200
250
300
350
memoryused(inMiB)
(a) large image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
10
20
30
40
50
60
70
memoryused(inMiB)
(b) small image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
10
20
30
40
50
60
70
memoryused(inMiB)
(c) simulated image
Figure 5.13: Luigi RAM memory usage with the custom profiler. The results
are consistent with what shown in figs. 5.4 - 5.6.
CHAPTER 5. COMPARISON 63
0 5 10 15 20 25 30
time (in seconds)
0
100
200
300
400
500
memoryused(inMiB)
(a) large image
0 5 10 15 20 25
time (in seconds)
0
25
50
75
100
125
150
175
200
memoryused(inMiB)
(b) small image
0 5 10 15 20 25
time (in seconds)
0
25
50
75
100
125
150
175
200
memoryused(inMiB)
(c) simulated image
Figure 5.14: Airflow RAM memory usage with the custom profiler. The
results are consistent with what shown in figs. 5.7 - 5.9.
CHAPTER 5. COMPARISON 64
5.4 CPU Usage
For CPU profiling, it was used the approach developed during the memory
analysis. As already mentioned, the use of top was a big help also to gather
data about CPU usage. Therefore a Python array of CPU usage was obtained
in the same manner described in subsection 5.3.1. Within the developed
pipeline, each task has to be run sequentially. The percentage of CPU load
is referred to the 4 cores dedicated to the virtual machine and, because that,
we expect to see an upper bound in the load set to 25%. In the other hand,
EPR runs on multiple processes and there is some chances to see loads over
25%. As expected, the plots in fig. 5.15 show that the CPU usage of ERP
can be as high as 43%, meanwhile in figs 5.16 and 5.17 the task execution
doesn’t exceed one core full usage, except for some isolated peaks. Of course,
both Luigi and Airflow can work in parallel inside a multi-core machine if
pipeline and tasks allow it. The interesting fact to note is how the resources
are used during the idle time, with a noticeable difference between Airflow
and ERP: the first one uses negligible to none CPU resources when no task is
executed, the other seems to require a constant CPU usage. The cause of this
behavior is the cyclic polling needed in order to check the execution status
of the tasks by EPR. Luigi, as shown in fig. 5.16, uses as much resources as
possible with minimum idle time. This characteristic makes it the quickest
workflow manager among the three, but it’s an aspect to consider carefully
when designing and launching the pipelines in order to not saturate the
resource available on the system. In all cases we can notice a peak in the
CPU usage due to the initialization of scheduler and web server.
CHAPTER 5. COMPARISON 65
0 10 20 30 40
time (in seconds)
0
10
20
30
40
%CPUusage
(a) large image
0 5 10 15 20 25 30 35 40
time (in seconds)
0
10
20
30
40
%CPUusage
(b) small image
0 5 10 15 20 25 30 35 40
time (in seconds)
0
5
10
15
20
25
30
%CPUusage
(c) simulated image
Figure 5.15: EPR CPU usage. This workflow manager constantly occupies
a certain amount of CPU to perform periodic polling to the Reactor, the
Twisted component to which the tasks execution is delegated.
CHAPTER 5. COMPARISON 66
0 2 4 6 8 10 12 14
time (in seconds)
0
5
10
15
20
25
%CPUusage
(a) large image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
0
5
10
15
20
25
%CPUusage
(b) small image
0.0 0.5 1.0 1.5 2.0 2.5
time (in seconds)
0
5
10
15
20
25
%CPUusage
(c) simulated image
Figure 5.16: Luigi CPU usage. Luigi exploits all the resources available to
execute the tasks and has no waiting time.
CHAPTER 5. COMPARISON 67
0 5 10 15 20 25 30
time (in seconds)
0
5
10
15
20
25
30
%CPUusage
(a) large image
0 5 10 15 20 25
time (in seconds)
0
5
10
15
20
25
%CPUusage
(b) small image
0 5 10 15 20 25
time (in seconds)
0
5
10
15
20
25
%CPUusage
(c) simulated image
Figure 5.17: Airflow CPU usage. Airflow exploits all the resource available
to execute the tasks and presents some waiting time.
CHAPTER 5. COMPARISON 68
5.5 Error Handling
The error handling is how the workflow manager behaves in case of failure
during the pipeline execution. An exception can raise in any point of the
pipeline execution and it’s the workflow manager duty handle the situation
and prevent or minimize repercussions on the system. Typically, two kind of
actions can be taken:
1. Abort and cancel the job
2. Abort and reschedule the job
Within the scientific context we are considering, it’s rare the case where a
failed pipeline is no longer needed and thus cancelled. In fact the scientific
value of the output is preserved even if the result is available some time
later, not having any deadline or real-time purpose. Then, ideally, we want
every triggered pipeline to be completed successfully. For these reasons,
we’ll analyze the workflow managers behavior in the case of failure in the
specific instance where it’s desirable that the pipeline is rescheduled until its
completion.
In order to simulate a failure, we introduced and error in the second
task of the Euclid-like pipeline developed previously, making it impossible to
complete successfully. Then, the workflow manager behavior in front of this
situation was observed, noticing, as expected, a stop in the pipeline execution.
EPR doesn’t implement any recovery strategy in case of pipeline failure. This
feature is foreseen in future versions of EDEN; for the moment the continuous
deployment tool guarantees a stable version for the pipeline is available. In
the scientific context is often necessary to validate the output by an expert an
thus requiring some sort of human intervention in any case. However an error
management component brings an important automation that reduces the
time needed to reset the environment and restart the pipeline. EPR notifies
the user highlighting in red the task in the dataflow graph where the error has
CHAPTER 5. COMPARISON 69
Figure 5.18: EPR error notification. The node in red highlight the task
where the failure occurred. At the same time the traceback of the execution
is shown.
occurred and gives a log message with the traceback of the execution (fig.
5.18). From this state it is not possible to recover the pipeline execution.
The user then has to fix the issue and re-run the entire pipeline, performing
all tasks again, including those already completed successfully.
Luigi handles a task crash in a smarter way, keeping the work done suc-
cessfully before the failure. As we can see in listing 5.1 Luigi notifies the user
through the terminal and at the same time the Graphic User Interface (GUI)
(figs. 5.19 and 5.20).
1 ===== Luigi Execution Summary =====
2
3 Scheduled 4 tasks of which :
4 ∗ 1 ran s u c c e s s f u l l y :
5 − 1 DetectionTask(<params>)
6 ∗ 1 f a i l e d :
7 − 1 MaskingTask(<params>)
8 ∗ 2 were l e f t pending , among these :
9 ∗ 2 had f a i l e d dependencies :
CHAPTER 5. COMPARISON 70
Figure 5.19: Luigi error notification on task list page. The task where the
error occurred is highlighted and a warning on the downstream task is shown.
CHAPTER 5. COMPARISON 71
Figure 5.20: Luigi error notification on dependencies graph page. The task
where the error occurred is highlighted (red dot).
10 − 1 DeblendingTask(<params>)
11 − 1 PhotometryTask(<params>)
12
13 This progress looks : ( because there were f a i l e d tasks
14
15 ===== Luigi Execution Summary =====
Listing 5.1: Luigi terminal output in case of task failure. In this case a
scenario where the second task fails is simulated.
In order to recover from this state, is sufficient to fix the issue and re-run
the pipeline without any other intervention. Luigi automatically checks each
task, from the last to the first, and if the required target exists for that
specific task, recover the pipeline execution from that point. This simple yet
very efficient way to handle errors makes Luigi an interesting choice if failure
rate is high and it’s very handy in the development phase. During the test,
we simply fixed the error in the second task and triggered again the pipeline.
The execution was a success and the final output was generated as expected.
CHAPTER 5. COMPARISON 72
Figure 5.21: Luigi recovered execution on task list page. All tasks are green
which indicates the successful execution.
For the outputs generated after the recover, see listing 5.2, figs. 5.21 and
5.22.
1 ===== Luigi Execution Summary =====
2
3 Scheduled 4 tasks of which :
4 ∗ 1 complete ones were encountered :
5 − 1 DetectionTask(<params>)
6 ∗ 3 ran s u c c e s s f u l l y :
7 − 1 DeblendingTask(<params>)
8 − 1 MaskingTask(<params>)
9 − 1 PhotometryTask(<params>)
10
11 This progress looks : ) because there were no
12 f a i l e d tasks or missing dependencies
13
CHAPTER 5. COMPARISON 73
Figure 5.22: Luigi recovered execution on dependency graph page. All tasks
are green which indicates the successful execution.
14 ===== Luigi Execution Summary =====
Listing 5.2: Luigi terminal output after pipeline recovery.
Another interesting feature that comes with this type of error handling, is
that if a pipeline is incorrectly triggered after a successful execution, Luigi
won’t run any task and instead notifies the user that the wanted output is
already available (see listing 5.3).
1 ===== Luigi Execution Summary =====
2
3 Scheduled 1 tasks of which :
4 ∗ 1 complete ones were encountered :
5 − 1 PhotometryTask(<params>)
6
7 Did not run any tasks
8 This progress looks : ) because there were no
9 f a i l e d tasks or missing dependencies
10
CHAPTER 5. COMPARISON 74
Figure 5.23: Airflow error notification. Each column represents a pipeline
run and each block is a task. The failed task is the red square, while the red
round represents the pipeline that has not been successfully completed.
11 ===== Luigi Execution Summary =====
Listing 5.3: Luigi terminal output in case a completed pipeline is executed
again.
Airflow takes on an intermediate behavior, requiring some manual inter-
vention to specify from where recover the pipeline. All the notifications are
through the web server, and we can see one of the several screen showing the
pipeline execution status in fig. 5.23.
5.6 Usability
Usability take in consideration: installation process and configuration,
pipeline definition and execution. The setup taken into account is: web server
and scheduler running in background with the help of a systemd service and
manual pipeline startup.
CHAPTER 5. COMPARISON 75
5.6.1 EPR Installation and Configuration
The Euclid Pipeline Runner need these steps in order to be installed and
configured:
1. Deploy to CVMFS or clone the libraries from the repository.
2. Define the configuration file for the systemd service into /etc/systemd/
system/euclid-ial-wfm.service/local.conf.
3. Define the configuration file for the Interface Abstraction Layer4
(IAL)
into /etc/euclid-ial/euclid_prs_app.cfg.
4. Define server configuration file into /etc/euclid-ial/euclid_prs.
cfg.
5. Load into the environment the two file /etc/profile.d/euclid.sh
and /etc/euclid-ial/euclid_prs.cfg.
6. Load into the environment the variable EUCLID PRS CFG=/etc/euclid-ial/
euclid_prs_app.cfg.
7. Start the service daemond through the bash commands systemctl
enable euclid-ial-wfm and systemctl start euclid-ial-wfm.
As pipeline definition, the EPR needs the tasks to be defined with a bash
command that in turn calls an Euclid project module. We have to create
two files to build the pipeline:
• Package definition: where the task are made wrapping the bash
commands in an Executor object, defining also input, output and the
max resources allowed to be allocated for the specific task.
4
An abstraction layer that allows the execution of for data processing software in each
SDC independently of the IT infrastructure
CHAPTER 5. COMPARISON 76
• Pipeline script: where the pipeline and its dependencies graph are
constructed combining the executor objects outlined into the package
definition.
In this manner every task is standalone and reusable as it is in any pipeline.
Therefore the EPR offer a good modular structure for tasks and pipelines
and a built-in capability for input/output definition and parameters passing.
5.6.2 Luigi Installation and Configuration
Luigi needs only three steps to be ready for use:
1. Install the Python package: pip install luigi.
2. Add luigid.service to /usr/lib/systemd/system defining a stan-
dard systemd service configuration.
3. Start the webserver service daemon through the bash commands systemctl
enable luigid and systemctl start luigid.
Luigi doesn’t distinguish task definition and pipeline construction: the de-
pendencies are specified directly inside each task, making them tight to a
particular data flow. It’s however possible to write the task’s login withing
an external function and reuse it in a modular way. Since the dependen-
cies are explicit inside a task, Luigi offers a built-in system to define inputs
and outputs and parameters passing throughout the pipeline. It doesn’t
guarantee a clean code because every parameter has to be passed as input
argument in each task downstream the one that needs it. However, there are
implementations that compensate for this lack, e.g. SciLuigi5
.
5
https://github.com/pharmbio/sciluigi
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study
Workflow management solutions: the ESA Euclid case study

More Related Content

Similar to Workflow management solutions: the ESA Euclid case study

Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...
Lorenzo D'Eri
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
Adel Belasker
 
FULLTEXT01.pdf
FULLTEXT01.pdfFULLTEXT01.pdf
FULLTEXT01.pdf
BizuayehuDesalegn
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalGustavo Pabon
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalGustavo Pabon
 
Master_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuMaster_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuJiaqi Liu
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
jeevanbasnyat1
 
Stale pointers are the new black - white paper
Stale pointers are the new black - white paperStale pointers are the new black - white paper
Stale pointers are the new black - white paperVincenzo Iozzo
 
Project final report
Project final reportProject final report
Project final report
ALIN BABU
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Nóra Szepes
 
masteroppgave_larsbrusletto
masteroppgave_larsbruslettomasteroppgave_larsbrusletto
masteroppgave_larsbruslettoLars Brusletto
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Pieter Van Zyl
 
phd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_enphd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_enPierre CHATEL
 

Similar to Workflow management solutions: the ESA Euclid case study (20)

Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...
 
MSc_Thesis
MSc_ThesisMSc_Thesis
MSc_Thesis
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 
Report-V1.5_with_comments
Report-V1.5_with_commentsReport-V1.5_with_comments
Report-V1.5_with_comments
 
KHAN_FAHAD_FL14
KHAN_FAHAD_FL14KHAN_FAHAD_FL14
KHAN_FAHAD_FL14
 
Thesis_Report
Thesis_ReportThesis_Report
Thesis_Report
 
thesis
thesisthesis
thesis
 
thesis
thesisthesis
thesis
 
SW605F15_DeployManageGiraf
SW605F15_DeployManageGirafSW605F15_DeployManageGiraf
SW605F15_DeployManageGiraf
 
FULLTEXT01.pdf
FULLTEXT01.pdfFULLTEXT01.pdf
FULLTEXT01.pdf
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
Master_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_LiuMaster_Thesis_Jiaqi_Liu
Master_Thesis_Jiaqi_Liu
 
AUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdfAUGUMENTED REALITY FOR SPACE.pdf
AUGUMENTED REALITY FOR SPACE.pdf
 
Stale pointers are the new black - white paper
Stale pointers are the new black - white paperStale pointers are the new black - white paper
Stale pointers are the new black - white paper
 
Project final report
Project final reportProject final report
Project final report
 
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
Thesis - Nora Szepes - Design and Implementation of an Educational Support Sy...
 
masteroppgave_larsbrusletto
masteroppgave_larsbruslettomasteroppgave_larsbrusletto
masteroppgave_larsbrusletto
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010
 
phd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_enphd_thesis_PierreCHATEL_en
phd_thesis_PierreCHATEL_en
 

Recently uploaded

Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 

Recently uploaded (20)

Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 

Workflow management solutions: the ESA Euclid case study

  • 1. University of Trieste DEPARTMENT OF ENGINEERING AND ARCHITECTURE Master degree in Computer and Electronic Engineering Workflow management solutions: the ESA Euclid case study Candidate Marco POTOK Matricola IN2000004 Thesis advisor Prof. Francesco FABRIS Thesis co-advisor Dott. Erik ROMELLI Academic Year 2017-2018
  • 2. Contents Introduction iii 1 Workflow Management 1 1.1 The Big Data Era . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 The Arise of e-Science . . . . . . . . . . . . . . . . . . 3 1.2 Workflow Manager . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Towards Scientific Workflow Managers . . . . . . . . . . . . . 5 2 Euclid Mission 8 2.1 Mission objective . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Spacecraft and Instrumentation . . . . . . . . . . . . . 10 2.2 Ground Segment Organization . . . . . . . . . . . . . . . . . . 13 2.3 SDC-IT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Software Development 17 3.1 Development Environment . . . . . . . . . . . . . . . . . . . . 17 3.2 Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Multi-Software Catalog Extractor . . . . . . . . . . . . . . . . 22 3.4.1 Developing and Versioning Methods . . . . . . . . . . . 23 3.4.2 Detection . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4.3 Deblending . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.4 Photometry . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Utils Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.1 The Masking Task . . . . . . . . . . . . . . . . . . . . 30 3.5.2 FITS Images . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.3 SkySim . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Pipeline Project . . . . . . . . . . . . . . . . . . . . . . . . . . 33 i
  • 3. CONTENTS ii 4 External Workflow Managers 36 4.1 Luigi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.1 Pipeline Definition . . . . . . . . . . . . . . . . . . . . 37 4.1.2 Integration with Elements . . . . . . . . . . . . . . . . 38 4.2 Airflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.1 DAG Definition . . . . . . . . . . . . . . . . . . . . . . 40 4.2.2 Pipelining Elements Projects . . . . . . . . . . . . . . . 40 5 Comparison 42 5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Memory Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Top Based Profiling Tool . . . . . . . . . . . . . . . . . 57 5.4 CPU Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.5 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.6 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6.1 EPR Installation and Configuration . . . . . . . . . . . 75 5.6.2 Luigi Installation and Configuration . . . . . . . . . . . 76 5.6.3 Airflow Installation and Configuration . . . . . . . . . 77 5.7 Distributed Computing . . . . . . . . . . . . . . . . . . . . . . 78 5.8 Workflow Visualization . . . . . . . . . . . . . . . . . . . . . . 79 5.9 Framework Integration . . . . . . . . . . . . . . . . . . . . . . 85 5.10 Triggering System . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.11 Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Conclusions 87 Acronyms 90
  • 4. Introduction Space missions, and scientific research in general, need to handle an ever- increasing amount of data, recently stepping de facto into the big data terri- tory. An infrastructure capable of managing this volume of data is required, including a proper workflow manager. Adopting an external tool, and thus avoiding the need to develop in-house software, can represent a time saving choice that allows to divert more resources towards science related activities. Luigi and Airflow are two workflow managers developed in an industrial context, respectively by Spotify and Airbnb. They are software tools made available to the public by means of an open-source license which, as a re- sult, has allowed them to be constantly maintained and improved by a large community. The aim of this thesis is to test the feasibility of using Luigi or Air- flow as workflow manager within a large scientific project such as the Euclid space mission, which is the case study of this work. The test consists in a comparison between the two workflow managers and the one currently used by the Euclid Consortium developers: the Euclid Pipeline Runner (EPR). The comparison is interesting for the scientific community since, through that, it can have an overview of the already available tools, if they fit their requirements and offer the necessary features. This can be a valuable oppor- tunity to estimate the overall performances and characteristics a workflow manager can provide, promoting the decision to add some components to the in-house design or even adopt the external tool completely. In order to capture the behavior of these tools during their operation, it is necessary to iii
  • 5. INTRODUCTION iv define a pipeline. For this reason, the first step of this work was become familiar with the Euclid development environment, within which, later, four software projects were created following the constraints of the mission frame- work. Afterwards, an Euclid-like scientific pipeline was built by means of the four projects, obtaining a first pipeline running thanks to EPR. Other two pipelines were then built, one for Luigi and one for Airflow, in order to obtain all the necessary elements for the comparison. The comparison between Luigi, Airflow and EPR was performed by means of ten metrics, chosen based on the main needs expressed by the developers of the euclidean environment: • Execution time • RAM memory usage • CPU usage • Error handling • Usability and configuration complexity • Distributed computing • Workflow visualization • Integration in a framework • Triggering system • Logging quality The results obtained seem encouraging and suggest that the external work- flow managers can actually be used within a space mission environment, bringing some performance improvement and offering several additional fea- tures compared to EPR. This work is structured as follow: in Chapter 1 an introduction to workflow managers will be made, explaining the need to use them to manage large amounts of data and what role they play in the scientific field. In Chapter 2 an overview of the Euclid mission will be framed, touching its scientific objectives and the organizational structure be- hind the part of the mission that handles the scientific data. Chapter 3 will be dedicated to describing the development environment of the mission and
  • 6. INTRODUCTION v the development phases of the software realized for this work. In Chapter 4 Luigi and Airflow, the external workflow managers, will be introduced, along with their main features and characteristics. Finally, Chapter 5 will focus on the description of the comparison metrics and the results obtained.
  • 7. Introduzione (Italian) Le missioni spaziali, e in generale la ricerca scientifica, devono gestire una quantit`a sempre maggiore di dati, entrando di recente nel territorio dei big data. La gestione di un tale volume di dati presenta delle sfide che devo- no essere affrontate con le giuste infrastrutture e disponendo degli strumenti software appropriati, come, ad esempio, un workflow manager. Adottare uno strumento esterno, evitando quindi di dover sviluppare in casa il software, pu`o rappresentare una scelta in grado di far risparmiare tempo che pu`o es- sere dedicato di conseguenza alle attivit`a prettamente scientifiche. Luigi e Airflow sono due workflow manager sviluppati in un contesto industriale, ri- spettivamente da Spotify e Airbnb. Essi sono degli strumenti software resi disponibili al pubblico per mezzo di una licenza open-source che, di con- seguenza, ha permesso loro di essere mantenuti e migliorati da un’ampia comunit`a. Lo scopo di questa tesi `e quello di valutare la possibilit`a di utilizzare Luigi o Airflow come workflow manager all’interno di un grande progetto scientifico come la missione spaziale Euclid, caso di studio per questo lavoro. L’ana- lisi consiste nel confrontare i due workflow manager con quello attualmente utilizzato dagli sviluppatori dell’Euclid Consortium: l’Euclid Pipeline Run- ner (EPR). Questo confronto risulta interessante per la comunit`a scientifica poich´e permette di ottenere una panoramica degli strumenti gi`a disponibili, stabilire se essi rientrano nei requisiti del sistema ed offrono le caratteristiche necessarie. Inoltre, risulta una preziosa opportunit`a per valutare le carat- teristiche e le prestazioni complessive che un workflow manager pu`o offrire, vi
  • 8. INTRODUCTION vii fornendo dei dati a supporto della decisione di adottare uno strumento ester- no o aggiungere alcune componenti software al proprio progetto. Per ottenere dei dati riguardo il comportamento di questi strumenti durante la loro ese- cuzione, `e necessario definire una pipeline. Quindi, il primo passo di questo lavoro `e stato quello di acquisire confidenza con l’ambiente di sviluppo di Eu- clid, all’interno del quale sono stati successivamente creati quattro progetti software rispettando i vincoli del framework della missione. Tali progetti so- no stati poi combinati in cascata, ottenendo una prima pipeline Euclid-like eseguibile con EPR. In seguito sono state costruite altre due pipeline, una per Luigi e una per Airflow, cos`ı da disporre di tutti gli elementi necessari per effettuare il confronto. Il confronto tra Luigi, Airflow e EPR `e stato condotto per mezzo di dieci metriche, scelte in base alle principali necessit`a espresse dagli sviluppatori dell’ambiente euclideo: • Tempo di esecuzione • Utilizzo della memoria RAM • Utilizzo della CPU • Gestione degli errori • Complessit`a di utilizzo e configurazione • Capacit`a di operare in modalit`a distribuita • Visualizzazione del flusso di lavoro • Integrazione in un framework • Capacit`a di avvio automatico delle esecuzioni • Qualit`a dei log generati I risultati ottenuti sembrano incoraggianti e suggeriscono che i workflow manager esterni possano essere di fatto utilizzati all’interno dell’ambiente di sviluppo di una missione spaziale, portando qualche miglioramento nelle prestazioni e offrendo alcune caratteristiche aggiuntive rispetto ad EPR. Questo lavoro `e strutturato nel modo seguente: nel Capitolo 1 verranno introdotti i workflow manager, spiegando le necessit`a che spingono al loro uti-
  • 9. INTRODUCTION viii lizzo e il ruolo che essi possiedono nel campo scientifico. Nel Capitolo 2 sar`a proposta una panoramica della missione Euclid, descrivendo i suoi obiettivi scientifici e la struttura organizzativa che si occupa della gestione dei dati scientifici. Il Capitolo 3 sar`a dedicato all’ambiente di sviluppo della missio- ne e al software sviluppato durante questo lavoro. Nel Capitolo 4 verranno introdotti i workflow manager esterni, Luigi e Airflow, descrivendo le loro principali caratteristiche. Infine, il Capitolo 5 sar`a dedicato alla definizione delle metriche di confronto e all’esposizione dei risultati ottenuti.
  • 10. Chapter 1 Workflow Management Data is a new extremely valuable resources and is collected in an increas- ing pace, but the true value is represented by the information enclosed inside it. For this reason, a wide range of new tools for big data processing have been developed in the last years. A subset of these tools are called workflow managers and they are in charge to coordinate the data processing steps. Automation capability and fault tolerance are the main required features, characteristics to be implemented in a distributed system. Also in the sci- entific community we can observe an increasingly adoption of large volume data acquisition [1]. This also apply to the field of astronomical research, the area in which this work has been carried out. In fact, the rapid evolution in computer technology and processing power boosted the design of more and more complex surveys and simulations. For instance, it was gradually possible to add extra dimensions to the data col- lected, such time or a third spacial dimension [2]. This extra dimension can be obtained by repeating views of the same object in order to spot transient phenomena, or can be a 3D scan, mapping the sky along the depth axis, as Euclid will be able to do. As has often happened in the history of modern astronomy, also in this 1
  • 11. CHAPTER 1. WORKFLOW MANAGEMENT 2 decade we are witnessing an explosion in the volume of the datasets used for astronomy and the amount of data has increased by an order of magnitude compared to just few years ago. Statistical analysis is now essential to make new discoveries obtained thanks to the correlation of a large volume of data, impossible to process with legacy methods where often wasn’t even used an information system. 1.1 The Big Data Era With the increasing demand of more and more information to improve the accuracy of scientific research, the world of astronomy has face data management problems that the industrial world has already begun to solve. This new type of data is part of the phenomenon called big data, although a precise definition of this entity has not yet been established. Big data are identified through their characteristics, among which five are widely accepted: Volume, Velocity, Variety, Veracity and Value, the 5 Vs of big data. • Volume: refers to the amount of data collected that can no longer be stored in a single node, but a complete system must be set up for the correct and efficient management of the data. Furthermore, a structure in the data is not guaranteed anymore and relational databases1 are no longer the best choice, resulting in an increase in database management systems that are no longer relational but which imitate an hash table structure. • Variety: represents the lack of homogeneity of the collected data, coming from different sources, unstructured or semi-structured. These types of data need a more intelligent processing chain that can adapt to the case. 1 A relational database stores data in tables consisting in columns and rows. Each column stores a type of data. Data in a table is related with a key, one or more columns that uniquely identify a row within the table.
  • 12. CHAPTER 1. WORKFLOW MANAGEMENT 3 • Velocity: refers to the volume generated per unit of time and also the rate at which data must then be processed and made available. With- out an appropriate distributed infrastructure it would be impossible to carry out such a difficult task. • Veracity: represents the guarantee that the data are consistent, reli- able and authentic. • Value: refers to an added value that without the use of data with the previous characteristics, it would not be possible to obtain. More data implies more accurate analysis and more reliable results. 1.1.1 The Arise of e-Science Whatever the final purpose of the research, from exploring the extremely small to mapping the vastness of the visible Universe, the aspects of data management, their analysis and their distribution, are increasingly predom- inant within scientific experimentation. The science that produces the large amounts of data that effectively possess the characteristics of big data is called e-Sciences, a term coined in the UK in 1999 by John Taylor, then general director of the Office of Science and Technology, who faithfully antic- ipated the direction of technological development that would have undertaken the scientific field from that moment on. e-Science is therefore the technolog- ical face of modern science, which produces and consumes large amounts of data and, for this reason, must be supported by an adequate infrastructure for storing, distributing and accessing the collected data. This infrastruc- ture is often called Scientific Data e-Infrastructure (SDI). Meanwhile, the term Cyberinstrastructure was created in the United States, which describes the same information and infrastructural needs that e-Science implies [3, 4]. This shows how the phenomenon developed at the end of the 1990s and early 2000s was in fact involving a large part of the scientific community. From that moment on, the demand for systems with improving performances and capable of handling an ever-increasing data volume has become progressively
  • 13. CHAPTER 1. WORKFLOW MANAGEMENT 4 more important. Adopting the Big Data paradigm for science was possible thanks to the change in mentality started with e-Science, maturing in an improvement of scientific instruments capable of collecting huge volumes of data and of the SDI infrastructure capable of distributing them appropriately [3]. 1.2 Workflow Manager New software tools, new architectures and new programming paradigms have been developed for management and processing of large amounts of data produced in response to new scientific and industrial needs. A subset of the tools developed in this ecosystem are the workflow managers, employed to build systems capable of working in a distributed environment and robust enough to carry out their task without causing a complete stop of the system in case of partial failure. A workflow manager is a software tool that helps to define the dependencies among a set of software modules or tasks. We can identify two main jobs the workflow manager has to accomplish: dependency resolution and task scheduling. The dependency resolution is essential to schedule the tasks in the right order and make sure every module is run if and only if all its dependencies are completed successfully. The scheduler has to decide when each task should be executed in order to optimize the available resources usage. 1.2.1 Data Pipeline A data pipeline is a task concatenation, where, generally speaking, the execution’s result of a module becomes the input for the next one. In this way, the modules can be developed independently and in a modular fashion, where the only requirement to meet is the interface defined between the two. This interface can be as simple as the information about the type of file and
  • 14. CHAPTER 1. WORKFLOW MANAGEMENT 5 its location in the file-system. Two distinct approaches can be identified in defining the pipelines and their workflows: the business and the scientific one. • A typical business workflow has features such as efficiency in execu- tion, independence between different workflows and human involvement in the process. Another characteristic of this approach is that a pipeline is defined through a control-flow, i.e. the dependencies between tasks are based on their status. For example, if a task X is dependent on task Y, X is not executed until Y is in completed state. In the end, typically the data are not streams, pipeline execution is not continuous but on demand when there is a need. • A scientific workflow has the task of producing outputs that are the result of experimentation, the instances of different workflows are in some degree correlated with each other and the automatism is ex- ploited as much as possible. Although automation is important, so it is the possibility to have access to intermediate results that an expert can validate on the fly. The pipeline is focused on the data-flow and no longer on the control-flow, i.e. a task is not executed until its in- put is available. This approach is therefore called data-driven. The data flow is described by a Directed Acyclic Graph (DAG) where each node represents a task and the graph’s topological ordering defines the dependencies. Data is often a continuous flow and all the tasks in a pipeline are usually working on different data at the same time. 1.3 Towards Scientific Workflow Managers For the reasons mentioned previously in this chapter, which include the production of big data in the scientific field, there has been an increase in the use of workflow managers in science, which is no longer conducted by the individual but has become a joint effort of many organizations and national
  • 15. CHAPTER 1. WORKFLOW MANAGEMENT 6 institutes. A scientific workflow manager must be the mean that allows a research team to obtain the results and therefore must be as transparent as possible to the user. Among its objectives we can find [4, 5]: • Description of complex scientific procedures, hence reuse of workflow, along modularity in the task construction, become important. • Automatic data processing according to the desired algorithms, and possibility to inspect intermediate results. • Provide high performance computation capabilities with the help of an appropriate infrastructure. • Reduce the amount of time researchers spend working on the tools, allowing them to spend more time conducting research. • Decrease machine time, optimizing software execution instead of in- creasing physical resources. To move towards these objectives, a huge amount of new tools have been developed within the scientific field, including programming languages and whole systems. For example, more than a hundred different custom pipeline managers have appeared in a short time, making it difficult to port systems and code, leading to a lack of result reproducibility [6]. This situation does not help scientific discoveries, which sometimes lose the ability to be testable from independent parties. A solution could be to identify a standard to adopt, or to make the tools developed free and easy to use. A step in the right direction, perhaps, could be to use more generic systems able to satisfy specific needs. One sector that needs generic tools is the industrial one, which has in fact produced highly efficient and easy to use systems. These tools are often not adopted by the scientific community that relies on products developed in-house. The result is that similar technologies in the scope are created independently by science and industry, thus missing an opportunity to share their capabilities and resources.
  • 16. CHAPTER 1. WORKFLOW MANAGEMENT 7 In this work we want to take the European Space Agency (ESA) Eu- clid Space Mission [7] and his workflow manager as a case study and verify, through an analysis of ten metrics, whether two popular tool for pipeline management developed in the industrial world can be adopted within a space mission. Contrary to what happens with scientific workflow managers, where the purpose is to solve a specific instance of a certain problem, in the field of IT companies and startups, the tendency is to take a path towards the design of a software that is as generic as possible, flexible and adaptable to different scenarios.
  • 17. Chapter 2 Euclid Mission The ESA’s mission Euclid is a medium class (M-class) mission, part of the Cosmic Vision program. Its main objective is to gather new insights about the true nature of dark matter and dark energy. In this chapter we will explore the main mission characteristics and the basic science behind the measurement of the universe. 2.1 Mission objective Thanks to the ESA Planck mission1 , the researchers confirmed that many questions about the nature of our Universe are currently open. As we see in fig. 2.1 only the 4.9% of the matter we are surrounded by is Baryonic Matter, i.e. what is commonly adressed as ordinary matter, such as all the atoms made of protons, neutrons and electrons2 . Another 26.8% is Dark Matter (DM), a component with high mass density that interacts with itself an other matter only gravitationally. Moreover, it doesn’t interact with the 1 http://sci.esa.int/planck 2 Although the electron is a lepton, in astronomy it is included as part of the baryonic matter because its mass, with respect to the mass of the proton and the neutron, is negligible. 8
  • 18. CHAPTER 2. EUCLID MISSION 9 Figure 2.1: Estimated composition of our Universe. Source: http://sci.esa.int/planck. electromagnetic force, making it transparent to the electromagnetic spectrum and really hard to spot. The remaining 68.3% is what it’s called Dark Energy (DE) for its unknown nature [8]. This component it’s linked with the accel- erated expansion of the Universe, but currently there is no direct evidence for its actual existence. Different models have been developed to explain the nature of the effects observed and currently attributable to dark energy. In order to gather more data that could bring more insights about the nature of dark matter and dark energy, the approach chosen is to observe and an- alyze two cosmological probes: the Weak Lensing (WL) and the Baryonic Acoustic Oscillation (BAO). The WL effect is caused by a mass concentra- tion that deflect the path of light traveling towards the observer. This effect is detectable only measuring some statistical and morphological properties of a large number of light sources. Euclid is expected to image about 1.5 billion galaxies, capturing useful data to study the correlation between their shape, mapping with high precision the expansion and grow history of the Universe [7]. A BAO is a density variations in the baryonic matter caused by a pressure wave formed in the primordial plasma of the universe. Mea- suring the BAO of the same source at different redshifts allows to estimate
  • 19. CHAPTER 2. EUCLID MISSION 10 the expansion rate of the universe. 2.1.1 Spacecraft and Instrumentation The Euclid space telescope is a 1.2 m Korsch architecture with 24.5 m focal length. The spacecraft carries two main tools that will generate all the data for the mission. They are both electromagnetic spectrum sensors, one specialized for photometry in the visible wavelengths and the other for infrared spectroscopy . • VIS: or VISible instrument, will used to acquire images in the visible range of the electromagnetic spectrum (550-900 nm). It is made of 36 CCDs, each counting 4069x4132 pixels (see fig. 2.3a). The weak lensing effect will be measured through the data obtained thanks to this instrument. • NISP: or Near Infrared Spectrometer and Photometer has two com- ponents: the near infrared spectrometer, operating in the 1100-2000 nm range and the near infrared imaging photometer, working in the Y (920-1146 nm), J (1146-1372 nm) and H (1372-2000 nm) bands. It is composed of 16 detectors, each counting 2040x2040 pixels (see fig. 2.3b). The main purpose of this instrument is to measure the BAO at different redshifts. The two instruments will have about the same field of view of 0.54 and 0.53 deg2 respectively, but VIS will offer a much greater resolution. Euclid has as a requirement to do a wide-survey of at least 15,000 deg2 of sky, and possibly reach 20,000 deg2 . In combination with this, another two deep-surveys of 20 deg2 each are planned [7].
  • 20. CHAPTER 2. EUCLID MISSION 11 Figure 2.2: Schematic figure of the Thales Alenia Space’s concept of the Euclid spacecraft. Source: http://sci.esa.int/euclid.
  • 21. CHAPTER 2. EUCLID MISSION 12 (a) One of the 36 CCDs that will compose the VIS instrument. Source: http://sci.esa.int/euclid. Copyright: e2v. (b) One of the 16 CCDs that will compose the NISP instrument. Source: http://sci.esa.int/euclid. Copyright: CPPM. Figure 2.3: Actual flight hardware for the Euclid spacecraft.
  • 22. CHAPTER 2. EUCLID MISSION 13 2.2 Ground Segment Organization As it is in mostly every space mission, the Euclid mission has its own space system made of three segments: • Space Segment: which includes the spacecraft along with the com- munication system. • Launch Segment: which is used to transport space segment elements to space • Ground Segment: which is in charge of spacecraft operations man- agement and payload data distribution and analysis. Inside the Euclid ground segment we can distinguish two parts: the Opera- tions Ground Segment (OGS) and the Science Ground Segment (SGS), the latter managed in collaboration with ESA and the Euclid Mission Consor- tium3 (EMC). Within the SGS, we can further identify three components: • Science Operation Center (SOC): which is in charge of the space- craft management and the execution of planned surveys. • Instrument Operation Teams (IOTs): which are responsible for instrument calibration and quality control on calibrated data. • Science Data Centers (SDCs): which are in charge to perform the data processing and the delivery of science-ready data products. Moreover they have the job of data validation and quality control. The amount of data expected from the mission, considering only the sci- entific data, is roughly 100 GB per day with a total of 30 PB for the entire mission. After all processing steps needed in order to obtain the science 3 https://www.euclid-ec.org
  • 23. CHAPTER 2. EUCLID MISSION 14 Figure 2.4: Data processing organization and responsibilities. All science- driven data processing is performed by the nine SDCs. Source: Romelli et al. [8], ADASS XXVIII conference proceedings (printing).. product, the EMC predicts about 100 PB of data to handle [9]. For this rea- son, it’s one of the first time a space mission begins a path of modernization towards an IT infrastructure capable of handling big data. Euclid adopts a distributed architecture for storing and processing data. Referring to Figure 2.4, we see how data processing and management takes place at nine sites in different countries. Each site is an SDC that manages a part of the data pro- cessing steps. The Organization Units (OUs) are working groups specialized in different aspects of the scientific data reduction and analysis. Each SDC is in support of one or more Organization Units (OUs), relation represented in the figure with a continuous line. The dashed line represents instead a deputy support. The data products generated throughout the mission are categorized in five processing levels:
  • 24. CHAPTER 2. EUCLID MISSION 15 • Level 1: data are unpacked, decompressed as well as the associated telemetry. • Level 2: data are submitted to a first processing phase that includes calibration step and a restoration one, where artifacts due to the in- strumentation are removed. • Level 3: data are in a form suitable for their scientific analysis and, because of this, they are called science-ready data. • Level E: external data coming from other missions or other projects. Before being included in the processing cycle, they must be euclidised to be consistent with the rest of the data. • Level S: simulated data, useful in the period before the mission for testing, validating and calibration of the systems developed for pro- cessing the data. In Figure 2.5, a diagram of how the data will be processed is shown. After downloading the data, they are routed to the SOC that becomes the point from which they are distributed. In the transformation of data from level 1 to level 2, in addition to the data of VIS, and NISP (distinguished in NIR for the photometric part and SIR for the spectroscopy one), external data are added, coming from the so called level E. These data are coming from instruments of other missions and this reflects the typical e-Science workflow. Equally emblematic, it is the insertion of level S data, that is artificially generated through simulations and have the purpose of testing the system before the data coming from the satellite are available.
  • 25. CHAPTER 2. EUCLID MISSION 16 Figure 2.5: Simplified data analysis flow. After downloading the data from the satellite, they are distributed to the SDCs via the SOC. The level S and E data, characteristic traits of e-Science, can be spotted. Source: Dubath et al. [9]. 2.3 SDC-IT The Italian Scientific Data Center (SDC-IT) is one of the nine SDCs involved in the mission. It is located at the Astronomical Observatory of Trieste4 (OATs) and plays a role of both primary and auxiliary reference for some OUs [8]. The author has carried out this work at the SDC-IT that has made available its structure and its expertise, along with the software tools used within the space mission. Euclid has its own development environment called Euclid Development ENviroment (EDEN), which will be used at each stage of this thesis development. In chapter 3 it will be described in detail, as well as the software produced by the author. 4 http://www.oats.inaf.it
  • 26. Chapter 3 Software Development As first part of this thesis work, three main activities were carried out: the first was to become familiar with the development environment used in the context of the mission. The second activity was to implement three software modules compliant with the Euclid rules that could be the foundation for an elementary scientific elaboration of astronomical images. In conclusion, the third activity involved the creation of a pipeline wrapping the modules developed in the second phase. The result was then an Euclid-like scientific pipeline running thanks to the workflow manager Euclid Pipeline Runner (EPR), designed by the EMC. The development environment used was the default one predisposed for the mission and it will be quickly illustrated in this chapter. Subsequently the program used and the code produced will be extensively described. All the software was written in Python 3, as foreseen by the official Euclid coding rules. 3.1 Development Environment The success of the Euclid mission depends on the collaboration of dozens of public and private entities, involving hundreds of people. To ensure all 17
  • 27. CHAPTER 3. SOFTWARE DEVELOPMENT 18 Feature Name Version Operating System CentOS 7.3 C++ 11 compiler gcc-c++ 4.8.5 Python 3 interpreter Python 3.6.2 Framework Elements 5.2.2 Version Control System Git 2.7.4 Table 3.1: List of main features in EDEN 2.0. All tools and respective versions are set for each environment. Source: Romelli et al. [8], ADASS XXVIII conference proceedings (printing). the software modules developed can in fact run together, the Consortium defined a common environment and set of rules. The environment, called Euclid Development ENviroment (EDEN), is a collection of frameworks and software packages that enclose all the tools available to the developers. As a distributed file system, the CernVM File System1 (CVMFS), developed at CERN as part of its infrastructure, is used. EDEN is locally available to the developers through LODEEN (LOcal DEvelopment ENviroment), a virtual machine based on Scientific Linux CentOS 7. The latest stable version, the 2.0, is used in this work. In Table 3.1 the main EDEN 2.0 features are listed. LODEEN is a local replica of the EDEN environment for Euclid devel- opers embedded in a virtual machine. It runs on Scientific Linux CentOS 7 operating system with Mate desktop environment. The version 2.0 is used in this work. CODEEN It’s a Jenkins-based2 system in charge of performing the contin- uous integration for all the source code developed inside the Euclid mission. Jenkins is an open source automation tool the helps to perform build, testing, delivering and deployment of software [10]. 1 https://cernvm.cern.ch/portal/filesystem 2 https://jenkins.io
  • 28. CHAPTER 3. SOFTWARE DEVELOPMENT 19 3.2 Elements As one of the fundamentals components of EDEN, we can find Elements, a framework that provides both CMake facilities and Python/C++ utilities. CMake3 is an open-source tool for building, testing and package software. Elements is derived from the CERN Gaudi Project4 [9], another open source project that helps to build frameworks in the domain of event data processing applications. Every Elements project must follow a well defined structure in order to be used inside the mission environment. Every project has to be placed inside a default folder into the Linux file-system: /home/user/Work/Projects. This allows to have a shared common location in the whole environment. Conse- quently the name of each developed project has to be unique. Therefore the three projects created by the author implement this required structure. Environment variables Elements needs a few environment variables in order to build and install the projects. Those come predefined in LODEEN but they deserve a quick overview. The first variable we will see is the BINARY TAG that contains the information for the build. It is composed by four parts separated by a dash: 1. Architecture’s instruction set 2. Name and version number of the distribution 3. Name and version of the compiler 4. Type of build configuration There are six types of build configurations, the default value is o2g and rep- resent a standard build. For the specific case of this work, BINARY_TAG= 3 https://cmake.org 4 http://gaudi.web.cern.ch/gaudi
  • 29. CHAPTER 3. SOFTWARE DEVELOPMENT 20 x86_64-co7-gcc48-o2g. The variable $CMAKE PREFIX PATH points to the newest version of the Elements CMake library. In this case equals to /usr/ share/EuclidEnv/cmake:/cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-2. 0/usr The environment variable CMAKE PROJECT PATH contains the location of the projects. In this case the value is: /home/user/Work/Projects: /cvmfs/euclid-dev.in2p3.fr/CentOS7/EDEN-2.0/opt/euclid, where the first path indicates the user area for the development and the second the system-installed projects. At the build moment a project resolves its depen- dencies through this variable. Project Structure An Elements project has to be organized in a well de- fined structure: ProjectName CMakeLists.txt Makefile ModuleName1 ModuleName2 ... Once built the project by the means of the CMake command make, a build.${BINARY TAG} folder is generated by the framework. Afterwards an installation phase is required to execute the program and let other projects use it as dependency. The structure then becomes: ProjectName CMakeLists.txt Makefile ModuleName1 ModuleName2 ... build.${BINARY TAG} InstallArea Module Structure A Elements module, referred in the project structure as ModuleName1 or ModuleName2, is a reusable unit of software and can be made of both C++ and Python code. It is thought as an independent unit
  • 30. CHAPTER 3. SOFTWARE DEVELOPMENT 21 that can be placed in any project. The structure is: ModuleName CMakeLists.txt ModuleName src python script auxdir ModuleName conf ModuleName test Finally, an important part of Elements is the E-Run command, thanks to which projects can be executed within the framework. 3.3 Image Processing After the astronomical images are acquired, properly merged and cleaned, it is possible to extract from them an objects catalog. These catalogs are then used to perform the analysis. In this context an object is typically a light source, such as a star, a galaxy or a galaxy cluster. The phase of extraction can be simplified as a process that involves three main steps: 1. Detection: The detection phase is a process generally referred as seg- mentation. From an astronomical image, the light sources are identified and extracted from the background. Segmentation of nontrivial images is one of the most difficult task in image processing [11] and represent a critical phase of the scientific processing in order to correctly identify the sources and obtain accurate results. 2. Deblending: Often light sources are not clearly separated from each other and in the first place is not possible to distinguish the single
  • 31. CHAPTER 3. SOFTWARE DEVELOPMENT 22 object. Is then necessary to run an additional step of deblending, a procedure for splitting highly overlapped sources. 3. Photometry: The final step for obtaining a source catalog is the mea- surement of their luminous flux. This is done by integrating the gray level value of the pixels labeled as one object. Furthermore, is possible to apply several masks in order to obtain more accurate measures. After the execution of these steps, the output generated is a catalog, some optional check images and an image highlighting the objects detected, as shown, for instance, in fig. 3.5c. 3.4 Multi-Software Catalog Extractor The author developed three Elements projects able to perform some basic image processing. The whole system is thought to represent a prototype of an Euclid-like project and it was essential to gain confidence with the environment. Furthermore these projects were the building blocks of the Euclid-like pipeline, used to test the workflow managers against the chosen metrics. The software was designed to follow the Object Oriented Programming (OOP) paradigm and be easily adaptable to new packages or tools to perform detection, deblending and photometry. As a whole, the software consists of four main Elements projects and one supporting with utilities. They are: • PT MuSoDetection, • PT MuSoDeblending, • PT MuSoPhotometry, • PT MuSoUtils.
  • 32. CHAPTER 3. SOFTWARE DEVELOPMENT 23 Indeed, in order to use another image processing tool, it is sufficient to ex- tend the Python class Program located in the PT MuSoUtils project, at the path PT_MuSoCore/python/PT_MuSoCore/Program.py, and implement the abstract method run with the proper logic. Each software module developed for the catalog extraction uses SExtractor5 as a image processing tool. This software, created by Emmanuel Bertin, has the main purpose of extracting an objects catalog from astronomical images [12] thought the execution of detection, deblending and photometry. It is part of the set of legacy software that, in this context, indicates external software that is officially integrated within EDEN. Since SExtractor isn’t available in Python, the subprocess module was used in order to create a new process that call the software through a bash command. The final code is in the SExtractor class. 3.4.1 Developing and Versioning Methods As version control system Git6 was used, following as much as possible a clean and easy to understand develop workflow. It was chosen to follow the Gitflow Workflow, defined for the first time by Vincent Driessen (see fig. 3.1) and later adopted as part of the Atlassian guide7 . This type of workflow involves a main branch, by convention called master, and includes all the commits that represent an incremental version of the software. In fact, no developer can work directly on the master, which can only be merged with other support branches, in particular a release (for lasts tweaks and version number change) or a hotfix (used to fix critical issues in the production software). When changes are merged into the master, by definition that becomes a new product release [13]. As regards the versioning system, a semantic versioning was applied to give a standard enumeration for each project version. A version number is made by three numbers which in turn represent a major release, a minor release and a patch. Each number has to 5 https://github.com/astromatic/sextractor 6 https://git-scm.com 7 https://it.atlassian.com/git/tutorials/comparing-workflows/ gitflow-workflow
  • 33. CHAPTER 3. SOFTWARE DEVELOPMENT 24 Figure 3.1: A successful git branching model. This workflow was followed during the Elements projects development. It favors a clear development path that facilitates the creation of a software product, especially within large teams. However, it remains a good pattern to follow even during autonomous development. Source: Git Branching - Branching Workflows [14].
  • 34. CHAPTER 3. SOFTWARE DEVELOPMENT 25 be incremented according to the following scheme [15]: • Major version: when you make incompatible API changes. • Minor version: when you add functionality in a backwards-compatible manner. • Patch version: when you make backwards-compatible bug fixes. As a second branch that is always present in a gitflow repository, it is the develop. In it are added the changes that are believed to be stable and ready for a future release. In addition to the release and hotfix branches, a branch of type feature can be used to develop new features that, once completed and tested, are added to the develop branch. 3.4.2 Detection The detection phase occurs by thresholding. Before proceeding, how- ever, a filtering is necessary to smooth the noise that would otherwise cause false positives: a Gaussian filter with standard deviation calibrated for the particular case is used. Moreover, in astronomical images there is often a non-uniform background that could generate ambiguities when the thresh- olding is applied, especially in the most crowded areas. For this reason, as a first step towards segmentation, a background map is generated that es- timates the light outside the objects to be detected [16]. This map will be subtracted from the image. As output of the detection we obtain an image partitioned in N regions, the first one of which, marked by pixels with value 0, is the background, while the remaining N-1 regions are the extracted objects and are labeled with values from 1 to N.
  • 35. CHAPTER 3. SOFTWARE DEVELOPMENT 26 Figure 3.2: Multithreshold deblending. This technique aims to deblend over- lapping sources that have been detected as a single object. Source: Bertin [16]. 3.4.3 Deblending After detection, the segmented image is subjected to a filtering phase that aims to identify distinct overlapping objects that the first thresholding step has recognized erroneously as a single object. During this phase, composite objects are deblended using a multithreshold hierarchical method. An hint on how the algorithm works ca be seen in fig. 3.2. 3.4.4 Photometry The purpose of this last phase is to measure the luminous flux belonging to each object identified after the deblending phase. For each set of pixels labeled with the same number, the sum of the pixel values is calculated to estimate the luminous flux. Some masks can be applied around objects, called apertures, to obtain different type of measurements (see fig. 3.3).
  • 36. CHAPTER 3. SOFTWARE DEVELOPMENT 27 Figure 3.3: Four different type of apertures available in SExtractor for the photometric measurement. Source: Holwerda [17]. 3.5 Utils Module The PT MuSoUtils module contains five important packages: • PT MuSoCore, • IOUtils, • SkySim, • FITSUtils, • DataModel. Since each pipeline task can use a different program for performing its pro- cessing step, a system for passing the parameters independently of the rest of the code has been created. The configuration parameters can be spec- ified in a JSON8 file that a parser developed for the occasion can read. 8 https://www.json.org
  • 37. CHAPTER 3. SOFTWARE DEVELOPMENT 28 The language was chosen for its human readable—and writable—structure and it’s usually very easy to handle inside the code, with a great support in Python (as it is in many other programming languages). Any program can have its own JSON file named <program>.json located in the path PT_MuSoUtils/PT_MuSoCore/auxdir/config. The structure of the file fol- lows this base scheme: { "program ": { "full_name ": <program_name >, "command_name ": <cmd_key > }, " configurations ": { <configuration_type >: { <configuration_unit >: { <parameter_name >: <parameter_value > } } } The configuration file is a JSON object with 2 names: program and confi- gurations. The value of program is a nested JSON object that expects two keys with string value: • full name: indicates the name of the program and is useful for logging purposes. • command name: indicates the terminal command to call if the pro- gram is executed through subprocess. The value of configurations is a nested JSON object that accepts any number of items representing a configuration type. Each configuration type is a nested JSON object that accepts any number of configuration units. Each configu- ration unit in turn accepts any number of key/string items representing an input parameter for the application.
  • 38. CHAPTER 3. SOFTWARE DEVELOPMENT 29 • configuration type: is a collection of configurations units, and it is intended to group all the parameters needed for a certain task, e.g. the detection or deblending. • configuration unit: is a collections of parameters that can be used as a modular unit or a preset for some particular instance of the con- figuration type, e.g. the detection of light sources in a crowded region of the sky. A configuration unit can be defined modular because another configuration units, within the same file, can inherit its parameters without the need to rewrite them, in an OOP fashion. In order to do so, is sufficient to add the key inherit that expects a JSON array as value. The items of the ar- ray are the configuration unit keys from which inherit the parameters. If the configuration units conf unit 1 in the configuration type conf type 1 wants to inherit the configuration unit conf unit 2 contained in the con- figuration type conf type 2, then the value of the key inherit has to be ”conf unit 2.conf type 2”. For example: " configurations ": { "base ": { "catalog_ascii ": { "-CATALOG_TYPE ": "ASCII_HEAD" } }, "detection ": { " background_checkimages ": { "inherit ": [ "base.catalog_ascii" ], ... } } } If two parameters have the same key, only the last value encountered by the parser is preserved. Each type of configuration can be called by a different
  • 39. CHAPTER 3. SOFTWARE DEVELOPMENT 30 project and some of them require to load other external file from its own auxiliary directory. For this reason, a constant auxdir can be define inside a configuration type and the parser will replace at run-time any sub-string matching ”{auxdir}” with the value of the constant. 3.5.1 The Masking Task Since the deblending task is a refinement phase with respect to the de- tection, it needs to know the segmented image, but it can not attempt to deblend an object if it is made up of pixels with all the same value and isn’t representative of the original source. For this reason, as input of the deblending, we must put the original image masked with the result of the de- tection. Pixels that have been labeled as background are brought to a value of -1000 in order to not interfere with the thresholding, while the remaining pixels are left unchanged. For this purpose an additional Python module was developed, called FITSMask. 3.5.2 FITS Images The Flexible Image Transport System (FITS) format was created for shar- ing astronomical images for sharing among observatories. The control au- thority for the format is the International Astronomical Union - Fits Working Group9 (IAU-FWG). The need for this standard stems from the difficulty of standardizing the format among all observatories with different characteris- tics and the impossibility of create adapters among all the different formats. Consequently it was created a standard for which every observatory is able to transform data from FITS to its own internal format, and vice versa. A FITS file is composed of blocks of 2880 bytes, organized in a sequence of Header and Data Unit or HUD eventually followed by special records. The 9 https://fits.gsfc.nasa.gov/iaufwg
  • 40. CHAPTER 3. SOFTWARE DEVELOPMENT 31 header is consisting of one or more blocks of 2880 bytes and contains stand alone information or metadata that describes the subordinated unit of data [18]. To manipulate this format several python modules are available, such as the astopy.io.fits, used in this work within the FITSUtil module. It was a necessary step for the Euclid-like pipeline because the mission uses this file format for storing and distributing the data collected by the space telescope. 3.5.3 SkySim As final part of the Euclid-like pipeline, SkySim, a basic sky simulator, was developed, which has the capability of generating synthetic astronomical images starting from a source catalog. The only purpose of this module was to briefly validate the pipeline and check if the catalog extracted was consistent with the one put as input. First, an image with no sources was generated (fig. 3.4), followed by an image with one source (fig. 3.5), with two identical sources (fig. 3.6), with two sources with different magnitude (fig. 3.7), with two overlapping sources (fig. 3.8) and finally with two overlapping sources with different magnitude (fig. 3.9). Each of this tests gave a positive result and the pipeline performed as expected, extracting the right number of sources with the correct photometric estimation. Real images are typically degraded by noise. SkySim has the ability to add additive noise with Gaussian distribution to an image according to a (a) Original (b) Deblended (c) Photometry Figure 3.4: No sources
  • 41. CHAPTER 3. SOFTWARE DEVELOPMENT 32 (a) Original (b) Deblended (c) Photometry Figure 3.5: One source (a) Original (b) Deblended (c) Photometry Figure 3.6: Two identical sources (a) Original (b) Deblended (c) Photometry Figure 3.7: Two sources with different magnitude (a) Original (b) Deblended (c) Photometry Figure 3.8: Two identical overlapping sources
  • 42. CHAPTER 3. SOFTWARE DEVELOPMENT 33 (a) Original (b) Deblended (c) Photometry Figure 3.9: Two overlapping sources with different magnitude predetermined Signal-to-Noise Ratio (SNR) value in dB. A normal Gaussian noise is defined by its mean µN and its standard deviation σN . Given the desired SNR to obtain in the synthetic image I, it’s possible to calculate the σN of the Gaussian noise to add: σN = var(I) 10SNR/10 By adding the pixel-by-pixel the values of the matrices that represent the noise and the original image, SkySim outputs an image with the chosen SNR. 3.6 Pipeline Project The pipeline project is the part of the developed software that aims to define and build the Euclid-like pipeline. Two files are required in order to define a pipeline: the Package Definition and the Pipeline Script. In the package definition file, as required by the Euclid Pipeline Runner, four Executable Python objects were implemented, one for each task. In listing 3.1 is represented the first task of the pipeline and performs the detection phase. Its input and outputs are specified as paths relative to workdir. The Executable of masking (listing 3.2), deblending (listing 3.3) and photometry (listing 3.4) have been defined in a completely similar manner.
  • 43. CHAPTER 3. SOFTWARE DEVELOPMENT 34 1 p t s e x t r a c t o r d e t e c t i o n = Executable ( 2 command=’ ’ . j o i n ( [ 3 ’E−Run PT MuSoDetection 0.2 ’ , 4 ’ SourceDetectorPipeline ’ , 5 ’ sextractor ’ , 6 ’−−preset p i p e l i n e ’ 7 ] ) , 8 inputs =[Input ( ’ f i t s i m a g e ’ ) ] , 9 outputs =[Output ( ’ segmentation map ’ ) , 10 Output ( ’ catalog ’ , mime type=’ txt ’ ) 11 ] 12 ) Listing 3.1: Definition of the detection task 1 mask image = Executable ( 2 command=’ ’ . j o i n ( [ 3 ’E−Run PT MuSoUtils 0.2 ’ , 4 ’FITSMask ’ , 5 ’−−mask value −1000 ’ 6 ] ) , 7 inputs =[Input ( ’ f i t s i m a g e ’ ) , 8 Input ( ’mask ’ ) ] , 9 outputs =[Output ( ’ masked ’ ) ] 10 ) Listing 3.2: Definition of the masking task 1 pt sextractor deblending = Executable ( 2 command=’ ’ . j o i n ( [ 3 ’E−Run PT MuSoDeblending 0.2 ’ , 4 ’ SourceDeblenderPipeline ’ , 5 ’ sextractor ’ , 6 ’−−preset p i p e l i n e ’ 7 ] ) , 8 inputs =[Input ( ’ f i t s i m a g e ’ ) ] , 9 outputs =[Output ( ’ segmentation map ’ ) , 10 Output ( ’ catalog ’ , mime type=’ txt ’ ) ] 11 ) Listing 3.3: Definition of the deblending task
  • 44. CHAPTER 3. SOFTWARE DEVELOPMENT 35 1 pt sextractor photometry = Executable ( 2 command=’ ’ . j o i n ( [ 3 ’E−Run PT MuSoPhotometry 0.2 ’ , 4 ’ SourcePhotometerPipeline ’ , 5 ’ sextractor ’ , 6 ’−−log−l e v e l debug ’ , 7 ’−−preset p i p e l i n e ’ 8 ] ) , 9 inputs =[Input ( ’ f i t s i m a g e ’ ) , 10 Input ( ’ f i t s i m a g e 2 ’ ) , 11 Input ( ’ assoc catalog ’ ) ] , 12 outputs =[Output ( ’ apertures map ’ ) , 13 Output ( ’ catalog ’ , mime type=’ txt ’ ) ] 14 ) Listing 3.4: Definition of the photometry task In addition the pipeline script file was created, where the source extractor method, decorated with the decorator @pipeline, provided by the euclidean framework (see listing 3.5). The logic inside specifies how to build the pipeline that will be executed by Euclid Pipeline Runner. 1 @pipeline ( outputs=( ’ segmentation map ’ , 2 ’ apertures map ’ , 3 ’ catalog ’ ) ) 4 def s o u r c e s e x t r a c t o r ( image ) : 5 seg map , = p t s e x t r a c t o r d e t e c t i o n ( f i t s i m a g e=image ) 6 masked = mask image ( f i t s i m a g e=image , mask=seg map ) 7 seg map deb , deb catalog = 8 pt sextractor deblending ( f i t s i m a g e=masked ) 9 apertures , catalog = 10 pt sextractor photometry ( f i t s i m a g e=masked , 11 f i t s i m a g e 2=image , 12 assoc catalog=deb catalog ) 13 return seg map deb , apertures , catalog Listing 3.5: Definition of the Euclid-like pipeline that will be executed by the EPR.
  • 45. Chapter 4 External Workflow Managers In the thesis work it is proposed to compare the workflow managers. As starting point, it was decided to use the Euclid-like pipeline already developed and to create two more pipelines implemented by means of two external tools: Spotify’s Luigi and Airbnb’s Airflow. They have been chosen because they are written in Python, thus EDEN compliant, are open-source and very popular within the data flow topic. 4.1 Luigi Luigi1 is a python package developed by the Spotify team and released in 2012 under the Apache License 2.02 . This tool helps to build pipeline of batch jobs [19]. Its features include: workflow management, task scheduling and dependency resolution. One of the Luigi’s strengths is to manage failures in a smart way, providing a build-in system for task status checking. In fact, if a task fails and has to be rescheduled and rerun, Luigi goes across the 1 https://github.com/spotify/luigi 2 The Apache License 2.0 is the second major release of the permissive free software license written by the Apache Software Foundation. See https://www.apache.org/ licenses/LICENSE-2.0. 36
  • 46. CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 37 dependency graph backwards until it encounters a successfully completed task. It then reschedules only the tasks downstream of the graph from that point, thus not scraping the work done without failures. This can save a lot of time and computing resources if the failures are not infrequent or if a pipeline shares some tasks already executed by another pipeline. 4.1.1 Pipeline Definition A Luigi’s pipeline is made of one or more tasks. For each task we can define an input, an output and the business logic to execute. The input is the set of parameters passed and the dependency list. The output is generally a file, called target, that will be written to the local file-system or to a distributed one, such as Hadoop Distributed File System (HDFS). Target The target is the product the task has to yield and it’s defined through a file-system path. In Luigi we can find the class Target that rep- resent this concept. As default configuration, the existence of the file is the proof that the task did execute successfully and its status is completed or not. It’s possible however to override the default logic for deciding whenever the task has to be considered done. Task A task consists of four fundamental parts: • Parameters: are the input arguments for the Task class and, with the class name, uniquely identify the task. They are defined inside the class using a Parameter object or a subclass of it. • Dependencies: are defined through a task collection and set which other tasks have to be executed successfully before the current one can start. Such collection is the object to return in the overridden method requires.
  • 47. CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 38 • Business logic: is defined within the overridden method run. This part of code is in charge to produce and store the output of the task. • Outputs: are defined through a collection of Target objects. Each target must points to the exact location of the file created by the busi- ness logic. Tasks Execution All the tasks defined in a Luigi pipeline are executed in the same process making the debugging process clear, but also setting a limit for the number of task a pipeline can be made of. Generally speaking, however, it doesn’t represent a real problem until thousands of tasks are executed in the same pipeline [19]. The execution of a task follows this steps: 1. Check if the predicate that defines the completed status is satisfied. If it is, then check the next task in the graph. If the current task is the last one, the pipeline is completed. 2. Resolve all the dependencies. If one task from the dependencies is not completed, then execute that task. 3. Execute the run method. In order to start the entire pipeline, is sufficient to call the last task defined in it and, thanks to the recursive algorithm, all tasks will be executed. Luigi does not come with an embedded triggering system, but it can be easily implemented. 4.1.2 Integration with Elements It’s not obvious that two frameworks can operate together and be easily integrated. In this section we will see how the author proposed to accomplish
  • 48. CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 39 the task of developing a Luigi pipeline that executes the Elements projects previously written and described in Chapter 3. Task Implementation In the case of this work, the logic to execute is essentially a call of the Elements project through the bash command E-Run. The command is preset in EDEN and is associated with the execution of the script that starts the Elements execution. The ExternalProgramTask, subclass of Task, was then used. This class is part of the Luigi’s contrib module and it manages the logic of the run method and exposes another method, program args, whose return output is the list of strings that will be the argument for the subprocess.Popen class. It was chosen to implement the pipeline following the behavior of EPR, so to get a similar to use system. As input, a path of the working directory and the data model file are required. Optionally it can be specified an id as parameter that allows to differentiate otherwise identical tasks (mainly for test and debug purposes). Each task has its own target file defined inside. The path must be relative to the working directory. The return value of the program args method is simply the string seen in listings 3.1 - 3.4 put as command argument. 4.2 Airflow Airflow3 is a python package developed by Airbnb and released in 2015 under the Apache License 2.0. In March 2016 the project joined the Apache Software Foundation’s incubation program4 and since then is growing very quickly. The Incubator project is the path to become part of the Apache Soft- ware Foundation (ASF) for projects that want to contribute to the Apache foundation. All the code that will become part of the ASF must first pass through an incubation period within this project [20]. Airflow is a tool for de- scribing, executing, and monitoring workflows. One of the airflow’s strengths 3 https://airflow.apache.org 4 https://incubator.apache.org.
  • 49. CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 40 is the simplicity with which we can define the pipeline, though it doesn’t offer a native way for specifying how the intermediate results are managed. More- over it presents a great interactive graphic user interface that makes easy to monitor the execution progress and state. 4.2.1 DAG Definition In Airflow every pipeline if defined as a Directed Acyclic Graph (DAG). As a matter of fact, the tool offers a Python class DAG that contains all the information needed for the tasks execution, such as dag id, dependency graph, start time, scheduling period, number of retry allowed and many other options. A task is a node of the dependency graph and is coded as an object of a class BaseOperator. That class is abstract and is designed to be inherited to define one of the three main operator types: • Action operator, that perform an action or trigger one • Transfer operator, in charge of moving data from one system to an- other • Sensor operator, that runs until a certain criterion is satisfied, like the existence of a file or a time in the day is reached. 4.2.2 Pipelining Elements Projects As done with Luigi and described in Section 4.1, the author developed an Airflow pipeline inside the Elements framework for executing the projects de- scribed in Chapter 3. First a DAG object has to be initialized with few manda- tory arguments as input: id, default arguments and schedule interval. Listing 4.1 shows how the dag can be created. 1 d e f a u l t a r g s = { 2 ’ owner ’ : ’ a i r f l o w ’ ,
  • 50. CHAPTER 4. EXTERNAL WORKFLOW MANAGERS 41 3 ’ s t a r t d a t e ’ : datetime . utcnow () , 4 ’ email ’ : [ ’ airflow@example . com ’ ] , 5 ’ r e t r i e s ’ : 1 , 6 ’ retry delay ’ : timedelta ( minutes=1) 7 } 8 9 dag = DAG( dag id=’ c a t a l o g e x t r a c t o r ’ , d e f a u l t a r g s=default args , 10 s c h e d u l e i n t e r v a l=’ @once ’ ) Listing 4.1: Definition of the main DAG with Airflow for the catalog extractor pipeline. To define a task performed by a bash command, there is the BashOperator object, which extends the BaseOperator. The string representing the bash command to be executed can be put as input of the operator. An example is shown in listing 4.2. 1 cmd = ’ ’ . j o i n ( [ ’E−Run ’ , 2 ’ PT MuSoDetection ’ , ’ 0.2 ’ , 3 ’ SourceDetectorPipeline ’ , 4 ’ sextractor ’ , 5 ’−−preset ’ , ’ p i p e l i n e ’ , 6 ’−−workdir ’ , work dir , 7 ’−−l o g d i r ’ , logdir , 8 ’−−f i t s i m a g e ’ , i n p u t f i l e , 9 ’−−segmentation map ’ , d e t e c t f i t s , 10 ’−−catalog ’ , detect cat 11 ] ) 12 13 detection task = BashOperator ( 14 task id=’ detection ’ , 15 bash command=cmd, 16 dag=dag ) Listing 4.2: Definition of the detection task with Airflow. After defining the tasks, there are several ways to define the dependency graph. In this case it was chosen to use the shift operator which is overloaded in BaseOperator to define the concatenation.
  • 51. Chapter 5 Comparison The purpose of this work is to make a comparison between different work- flow managers from different contexts and verify, through a test on the case study of Euclid, if such external tools can be adopted as an integral part of the system. Four Elements projects have been developed that act as building blocks for the pipelines and the EPR, Luigi and Airflow workflow managers have been studied. In this chapter we will propose ten metrics to evaluate whether Luigi or Airflow could be used within the mission. Subsequently, the executions necessary for the comparison will be performed and the results will be analyzed. 5.1 Metrics At this point of the work everything is set to execute the pipelines and gather the data needed for the comparison. As comparison metrics we pro- posed: • Execution time: is an indicator of the overhead of the tool and shows how the scheduler handles the executions. 42
  • 52. CHAPTER 5. COMPARISON 43 • RAM memory usage: will give a measure of the memory usage needed by the tool. It’s important because the resources are limited. • CPU usage: will give a measure of the CPU usage needed by the tool. It’s important because the resources are limited. • Error handling: will show how the tool can handle failure inside the system. It’s critical for automatic recovery and minimization of human intervention. • Usability and configuration complexity: will show how complex the tool is to use and configure. • Distributed computing management: will show the capability of the tool to execute tasks in a distributed environment. • Workflow visualization: is important for monitoring the execution progress and spot any problem inside the system. • Integration in a framework: will show if the tool is easy to use inside an existing framework, critical for the current case study. • Triggering system: will show if the tool is capable of automating the execution triggering. • Logging quality: will show if the log produced are in fact useful for the developers to debug the system in case of failure. At the end of the data collection we’ll draw the conclusions comparing the two types of workflow manager. The machine used for the test has an Intel(R) Core(TM) i7-4770 CPU @ 3.40 GHz, with 4 cores dedicated to the virtual machine and a total dedicated memory RAM of 10.46 GB.
  • 53. CHAPTER 5. COMPARISON 44 5.2 Execution Time In order to profile the three workflow managers with respect of time, it was sufficient to use the Linux built-in command time and some generated logs. Each tool was setup to run with no delay and minimum waiting time and was tested with three different images put in input to the pipeline: a small one 252x252 pixels, a large one 5000x5000 and another small image 256x256 generated by SkySim, referred as simulated. All the results are averaged over 10 runs. The time command gives as result three numbers: real time, user time and sys time. The meanings are [21]: • real: or wall clock time, indicates the total time elapsed from start to finish. • user: CPU time spent within the process outside the Linux kernel, or in user mode. In user mode, the process can’t directly access hardware or reference memory. Code running in this mode can perform lower level accesses only through system APIs. • sys: CPU time spent within the process inside the Linux kernel, or in kernel mode. In kernel mode, the process can execute any CPU instruction and access any memory address, without any restriction. This mode is generally reserved for low-level functions of the operating system and must therefore consist of trusted code. There are some privileged instructions that can only be executed in kernel mode. These are interrupt instructions, input output management and so on. If this type of instructions are executed in user mode a trap will be generated. From tables 5.1, 5.2 and 5.3 we can notice three main differences. First of all, EPR will execute all three pipeline instances in roughly the same time, due to internal scheduling settings that are not meant to be changed by the user. Secondly, although ERP and Airflow perform about the same with the large image as an input, in the case of the two smaller images, Airflow
  • 54. CHAPTER 5. COMPARISON 45 EPR Luigi Airflow real 40.419 16.665 38.325 user 21.258 14.570 21.194 sys 8.346 0.922 1.688 Table 5.1: Time needed for executing the scientific pipeline on the large image. Values are in seconds, divided by real, user and sys time. EPR Luigi Airflow real 40.413 2.652 30.171 user 21.481 1.607 8.701 sys 8.649 0.298 1.022 Table 5.2: Time needed for executing the scientific pipeline on the small image. Values are in seconds, divided by real, user and sys time. EPR Luigi Airflow real 40.409 2.634 30.008 user 21.871 1.781 8.526 sys 8.816 0.347 1.181 Table 5.3: Time needed for executing the scientific pipeline on the simulated image. Values are in seconds, divided by real, user and sys time. can complete the pipeline faster, reducing the real time accordingly with the lowering of user and sys times. This means that Airflow keeps the overhead steady in each execution. EPR, in the other hand, needs always the same real, user and sys times in order to complete all task of the pipeline. This lead to conclude that EPR has time slots in which it performs the tasks and this behavior can not be modified. It must be noted that the sys time is consistently much higher than the other instruments. Finally, Luigi confirms itself as the most lightweight workflow manager among the three from a time perspective. Its scheduler executes each tasks right away as soon as enough system resources are available.
  • 55. CHAPTER 5. COMPARISON 46 5.3 Memory Usage As memory profiling tool it was chosen the Python memory profiler1 , a module for recording the memory consumption of a process and its children based on psutil2 . It was used in the time-based configuration, where the memory usage is plotted as function of execution time. Indeed it’s possible to directly plot the data after recording them by means of the same module. In order to start the monitoring, the command mprof run <script> can be used. After the script execution is done, the command mprof plot, shows the plot of the last run recorded. Because all pipelines are built with Elements projects as tasks, all of them use at least one subprocess in order to run. For this reason it was set the flags --include-children (or -C for short) first and then --multiprocess (or -M for short) to tell memory profiler to consider all the children created by the main process. The include children flag adds the memory used by the main process and all its children, obtaining a single comprehensive value each instant. The multiprocess flag considers all children independently, keeping the memory usage data separated for each of them. In this section MB will be used as synonym of MiB, equivalent to 220 bits, and GB as synonym of GiB, or 230 bits. This method was used to profile all three workflow managers, although it was not possible to obtain meaningful results from the EPR execution, due to its implementation. The workaround will be described later in this chapter. After profiling the three workflow managers, each one executing the scientific pipeline with the three images as inputs, the plots in figs. 5.1 - 5.9 were obtained. Figures 5.4 - 5.9 show, as expected, an increase in memory usage due to image processing in correspondence with the four tasks execution. In the other hand, in figure 5.1 - 5.3, plot of EPR RAM usage, seems that no execution is detected. Referring to fig. 5.1b compared to figs. 5.4b and 5.7b it’s evident how the peak memory usage is not compatible with the amount required by the second task execution (about 350 MB). Furthermore, 1 https://pypi.org/project/memory-profiler 2 https://pypi.org/project/psutil
  • 56. CHAPTER 5. COMPARISON 47 the three executions, that have to process images with significant difference in size, seems to require the same amount of memory with an average of roughly 45 MB. Analyzing the data gathered per process, we can see that every execution presents a main process of 38 MB and a main child process of 13 MB. One possible explanation for this behaviour could be the resources limitation by the EPR, but this limit was set to 1 GB, significantly below the values obtained and the amount needed by the task. Consequently, it was conduct a deeper study of the Python source code related to scheduling and pipeline execution in the euclidean software. It was found out that the tool is in part built by means of the package Twisted3 and its reactor component, to which the actual task execution is delegated. Reactor works in background in a separated thread and communicates with the main thread though a callback system. This behavior is not detected by the profiler, which records only the memory used by EPR and Twisted, explaining why all plots shared a common trend. This finding has led to try other profilers but none seemed to work properly in this situation. It was then decided to develop a custom tool for memory profiling, using the features of top bash command. It has the ability to report the memory and CPU usages, which turned out to be very useful also in the CPU load analysis. 3 https://twistedmatrix.com
  • 57. CHAPTER 5. COMPARISON 48 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. The memory usage is similar throughout the duration of profiling. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Also in this case, allocations of memory compatible with the image processing under examination are not noticed. Figure 5.1: EPR RAM memory usage versus time (large image).
  • 58. CHAPTER 5. COMPARISON 49 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. The memory usage is similar throughout the duration of profiling, as it happens in fig. 5.1. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Also in this case, allocations of memory compatible with the image processing under examination are not noticed. Figure 5.2: EPR RAM memory usage versus time (small image).
  • 59. CHAPTER 5. COMPARISON 50 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. The memory usage is similar throughout the duration of profiling, as it happens in fig. 5.1 (b) Memory usage divided by process: the main is shown in black, while its children are colored. Also in this case, allocations of memory compatible with the image processing under examination are not noticed. Figure 5.3: EPR RAM memory usage versus time (simulated image).
  • 60. CHAPTER 5. COMPARISON 51 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. Three of them are SExtractor processes (in green), while the highest peak in red represent the memory usage of the Python masking project. Figure 5.4: Luigi RAM memory usage versus time (large image).
  • 61. CHAPTER 5. COMPARISON 52 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Figure 5.5: Luigi RAM memory usage versus time (small image).
  • 62. CHAPTER 5. COMPARISON 53 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Figure 5.6: Luigi RAM memory usage versus time (simulated image).
  • 63. CHAPTER 5. COMPARISON 54 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Airflow needs seven processes in order to execute the pipeline. Figure 5.7: Airflow RAM memory usage versus time (large image).
  • 64. CHAPTER 5. COMPARISON 55 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Airflow needs seven processes in order to execute the pipeline. Figure 5.8: Airflow RAM memory usage versus time (small image).
  • 65. CHAPTER 5. COMPARISON 56 (a) Total memory usage, sum of the memory used by the main process and all its children. The horizontal dashed red line marks the limit of maximum utilization during the execution, while the vertical one is in correspondence of the instant of time in which the maximum peak occurred. Four main peaks are distinctly visible, they are associate with the execution of the four tasks in the pipeline. (b) Memory usage divided by process: the main is shown in black, while its children are colored. Airflow needs seven processes in order to execute the pipeline. Figure 5.9: Airflow RAM memory usage versus time (simulated image).
  • 66. CHAPTER 5. COMPARISON 57 5.3.1 Top Based Profiling Tool Euclid Pipeline Runner has a structure that doesn’t allow to profile the ex- ecution directly. In fact it run asynchronously the tasks though a Twisted job submission. The top command allows to obtain a resources usage overview on the whole system in a particular time instant ass overall statistics and di- vided by process. The visualization is interactive and it’s possible to display processes sorted by memory or CPU usage. Top allows also to specify the time interval between samples, in this case 0.1 s. The flag -c tells to show the command line associated with the process instead of just the program’s name. This was necessary in order to include the right processes in the pro- filing. In fact, top shows a system snapshot and it’s crucial to manually filter all and only the wanted tasks. Finally it was set the -b flag to run the command in batch mode, to be used when it’s required to redirect the output into a file. Each sample was indeed written to a file, which was subsequently pre-processed until obtain a Python array ready for the actual analysis. At the end, the bash command was top -d 0.1 -c -b > out.top. The pre- processing and processing phases were written in a Python script. The steps used for the pre-processing were: 1. Load the content of the top file as string; 2. Extraction of lines containing the keywords that identify the processes to profile; 3. Additional processes filtering to remove unwanted items; 4. Replacement of multiple blank lines with a single one; 5. Text strip to remove void parts at the beginning and end of the file content; 6. Columns extraction, keeping only the memory and CPU load values; 7. Split string with blank line as separator, obtaining a sample array,
  • 67. CHAPTER 5. COMPARISON 58 0 10 20 30 40 time (in seconds) 100 200 300 400 memoryused(inMiB) Figure 5.10: EPR memory usage versus time obtained with the custom pro- filer (large image). Unlike what we saw in fig. 5.1a, in this case the four peaks corresponding to the tasks execution are clearly visible. where a sample contains the values of one or more simultaneous pro- cesses; 8. Split sample lines with space as separator, obtaining a values array for each process; 9. For each sample, sum corresponding values of all relative processes, obtaining one value per parameter per sample.
  • 68. CHAPTER 5. COMPARISON 59 0 5 10 15 20 25 30 35 40 time (in seconds) 20 40 60 80 100 memoryused(inMiB) Figure 5.11: EPR memory usage versus time obtained with the custom pro- filer (small image). Unlike what shown in fig. 5.2a, in this case the four peaks corresponding to the tasks execution are clearly visible.
  • 69. CHAPTER 5. COMPARISON 60 0 5 10 15 20 25 30 35 40 time (in seconds) 20 40 60 80 100 memoryused(inMiB) Figure 5.12: EPR memory usage versus time obtained with the custom pro- filer (simulated image). Unlike what shown in fig. 5.3a, in this case the four peaks corresponding to the tasks execution are clearly visible.
  • 70. CHAPTER 5. COMPARISON 61 The profiling has been repeated also for Luigi and Airflow in order to have a check on the reliability of the method. The results obtained with memory profiler and this tool was compared, verifying the consistency of the two. The plots obtained by means of the custom profiler are shown in Figs. 5.13 and 5.14. The results proved to be consistent with those previously obtained.
  • 71. CHAPTER 5. COMPARISON 62 0 2 4 6 8 10 12 14 time (in seconds) 0 50 100 150 200 250 300 350 memoryused(inMiB) (a) large image 0.0 0.5 1.0 1.5 2.0 2.5 time (in seconds) 10 20 30 40 50 60 70 memoryused(inMiB) (b) small image 0.0 0.5 1.0 1.5 2.0 2.5 time (in seconds) 10 20 30 40 50 60 70 memoryused(inMiB) (c) simulated image Figure 5.13: Luigi RAM memory usage with the custom profiler. The results are consistent with what shown in figs. 5.4 - 5.6.
  • 72. CHAPTER 5. COMPARISON 63 0 5 10 15 20 25 30 time (in seconds) 0 100 200 300 400 500 memoryused(inMiB) (a) large image 0 5 10 15 20 25 time (in seconds) 0 25 50 75 100 125 150 175 200 memoryused(inMiB) (b) small image 0 5 10 15 20 25 time (in seconds) 0 25 50 75 100 125 150 175 200 memoryused(inMiB) (c) simulated image Figure 5.14: Airflow RAM memory usage with the custom profiler. The results are consistent with what shown in figs. 5.7 - 5.9.
  • 73. CHAPTER 5. COMPARISON 64 5.4 CPU Usage For CPU profiling, it was used the approach developed during the memory analysis. As already mentioned, the use of top was a big help also to gather data about CPU usage. Therefore a Python array of CPU usage was obtained in the same manner described in subsection 5.3.1. Within the developed pipeline, each task has to be run sequentially. The percentage of CPU load is referred to the 4 cores dedicated to the virtual machine and, because that, we expect to see an upper bound in the load set to 25%. In the other hand, EPR runs on multiple processes and there is some chances to see loads over 25%. As expected, the plots in fig. 5.15 show that the CPU usage of ERP can be as high as 43%, meanwhile in figs 5.16 and 5.17 the task execution doesn’t exceed one core full usage, except for some isolated peaks. Of course, both Luigi and Airflow can work in parallel inside a multi-core machine if pipeline and tasks allow it. The interesting fact to note is how the resources are used during the idle time, with a noticeable difference between Airflow and ERP: the first one uses negligible to none CPU resources when no task is executed, the other seems to require a constant CPU usage. The cause of this behavior is the cyclic polling needed in order to check the execution status of the tasks by EPR. Luigi, as shown in fig. 5.16, uses as much resources as possible with minimum idle time. This characteristic makes it the quickest workflow manager among the three, but it’s an aspect to consider carefully when designing and launching the pipelines in order to not saturate the resource available on the system. In all cases we can notice a peak in the CPU usage due to the initialization of scheduler and web server.
  • 74. CHAPTER 5. COMPARISON 65 0 10 20 30 40 time (in seconds) 0 10 20 30 40 %CPUusage (a) large image 0 5 10 15 20 25 30 35 40 time (in seconds) 0 10 20 30 40 %CPUusage (b) small image 0 5 10 15 20 25 30 35 40 time (in seconds) 0 5 10 15 20 25 30 %CPUusage (c) simulated image Figure 5.15: EPR CPU usage. This workflow manager constantly occupies a certain amount of CPU to perform periodic polling to the Reactor, the Twisted component to which the tasks execution is delegated.
  • 75. CHAPTER 5. COMPARISON 66 0 2 4 6 8 10 12 14 time (in seconds) 0 5 10 15 20 25 %CPUusage (a) large image 0.0 0.5 1.0 1.5 2.0 2.5 time (in seconds) 0 5 10 15 20 25 %CPUusage (b) small image 0.0 0.5 1.0 1.5 2.0 2.5 time (in seconds) 0 5 10 15 20 25 %CPUusage (c) simulated image Figure 5.16: Luigi CPU usage. Luigi exploits all the resources available to execute the tasks and has no waiting time.
  • 76. CHAPTER 5. COMPARISON 67 0 5 10 15 20 25 30 time (in seconds) 0 5 10 15 20 25 30 %CPUusage (a) large image 0 5 10 15 20 25 time (in seconds) 0 5 10 15 20 25 %CPUusage (b) small image 0 5 10 15 20 25 time (in seconds) 0 5 10 15 20 25 %CPUusage (c) simulated image Figure 5.17: Airflow CPU usage. Airflow exploits all the resource available to execute the tasks and presents some waiting time.
  • 77. CHAPTER 5. COMPARISON 68 5.5 Error Handling The error handling is how the workflow manager behaves in case of failure during the pipeline execution. An exception can raise in any point of the pipeline execution and it’s the workflow manager duty handle the situation and prevent or minimize repercussions on the system. Typically, two kind of actions can be taken: 1. Abort and cancel the job 2. Abort and reschedule the job Within the scientific context we are considering, it’s rare the case where a failed pipeline is no longer needed and thus cancelled. In fact the scientific value of the output is preserved even if the result is available some time later, not having any deadline or real-time purpose. Then, ideally, we want every triggered pipeline to be completed successfully. For these reasons, we’ll analyze the workflow managers behavior in the case of failure in the specific instance where it’s desirable that the pipeline is rescheduled until its completion. In order to simulate a failure, we introduced and error in the second task of the Euclid-like pipeline developed previously, making it impossible to complete successfully. Then, the workflow manager behavior in front of this situation was observed, noticing, as expected, a stop in the pipeline execution. EPR doesn’t implement any recovery strategy in case of pipeline failure. This feature is foreseen in future versions of EDEN; for the moment the continuous deployment tool guarantees a stable version for the pipeline is available. In the scientific context is often necessary to validate the output by an expert an thus requiring some sort of human intervention in any case. However an error management component brings an important automation that reduces the time needed to reset the environment and restart the pipeline. EPR notifies the user highlighting in red the task in the dataflow graph where the error has
  • 78. CHAPTER 5. COMPARISON 69 Figure 5.18: EPR error notification. The node in red highlight the task where the failure occurred. At the same time the traceback of the execution is shown. occurred and gives a log message with the traceback of the execution (fig. 5.18). From this state it is not possible to recover the pipeline execution. The user then has to fix the issue and re-run the entire pipeline, performing all tasks again, including those already completed successfully. Luigi handles a task crash in a smarter way, keeping the work done suc- cessfully before the failure. As we can see in listing 5.1 Luigi notifies the user through the terminal and at the same time the Graphic User Interface (GUI) (figs. 5.19 and 5.20). 1 ===== Luigi Execution Summary ===== 2 3 Scheduled 4 tasks of which : 4 ∗ 1 ran s u c c e s s f u l l y : 5 − 1 DetectionTask(<params>) 6 ∗ 1 f a i l e d : 7 − 1 MaskingTask(<params>) 8 ∗ 2 were l e f t pending , among these : 9 ∗ 2 had f a i l e d dependencies :
  • 79. CHAPTER 5. COMPARISON 70 Figure 5.19: Luigi error notification on task list page. The task where the error occurred is highlighted and a warning on the downstream task is shown.
  • 80. CHAPTER 5. COMPARISON 71 Figure 5.20: Luigi error notification on dependencies graph page. The task where the error occurred is highlighted (red dot). 10 − 1 DeblendingTask(<params>) 11 − 1 PhotometryTask(<params>) 12 13 This progress looks : ( because there were f a i l e d tasks 14 15 ===== Luigi Execution Summary ===== Listing 5.1: Luigi terminal output in case of task failure. In this case a scenario where the second task fails is simulated. In order to recover from this state, is sufficient to fix the issue and re-run the pipeline without any other intervention. Luigi automatically checks each task, from the last to the first, and if the required target exists for that specific task, recover the pipeline execution from that point. This simple yet very efficient way to handle errors makes Luigi an interesting choice if failure rate is high and it’s very handy in the development phase. During the test, we simply fixed the error in the second task and triggered again the pipeline. The execution was a success and the final output was generated as expected.
  • 81. CHAPTER 5. COMPARISON 72 Figure 5.21: Luigi recovered execution on task list page. All tasks are green which indicates the successful execution. For the outputs generated after the recover, see listing 5.2, figs. 5.21 and 5.22. 1 ===== Luigi Execution Summary ===== 2 3 Scheduled 4 tasks of which : 4 ∗ 1 complete ones were encountered : 5 − 1 DetectionTask(<params>) 6 ∗ 3 ran s u c c e s s f u l l y : 7 − 1 DeblendingTask(<params>) 8 − 1 MaskingTask(<params>) 9 − 1 PhotometryTask(<params>) 10 11 This progress looks : ) because there were no 12 f a i l e d tasks or missing dependencies 13
  • 82. CHAPTER 5. COMPARISON 73 Figure 5.22: Luigi recovered execution on dependency graph page. All tasks are green which indicates the successful execution. 14 ===== Luigi Execution Summary ===== Listing 5.2: Luigi terminal output after pipeline recovery. Another interesting feature that comes with this type of error handling, is that if a pipeline is incorrectly triggered after a successful execution, Luigi won’t run any task and instead notifies the user that the wanted output is already available (see listing 5.3). 1 ===== Luigi Execution Summary ===== 2 3 Scheduled 1 tasks of which : 4 ∗ 1 complete ones were encountered : 5 − 1 PhotometryTask(<params>) 6 7 Did not run any tasks 8 This progress looks : ) because there were no 9 f a i l e d tasks or missing dependencies 10
  • 83. CHAPTER 5. COMPARISON 74 Figure 5.23: Airflow error notification. Each column represents a pipeline run and each block is a task. The failed task is the red square, while the red round represents the pipeline that has not been successfully completed. 11 ===== Luigi Execution Summary ===== Listing 5.3: Luigi terminal output in case a completed pipeline is executed again. Airflow takes on an intermediate behavior, requiring some manual inter- vention to specify from where recover the pipeline. All the notifications are through the web server, and we can see one of the several screen showing the pipeline execution status in fig. 5.23. 5.6 Usability Usability take in consideration: installation process and configuration, pipeline definition and execution. The setup taken into account is: web server and scheduler running in background with the help of a systemd service and manual pipeline startup.
  • 84. CHAPTER 5. COMPARISON 75 5.6.1 EPR Installation and Configuration The Euclid Pipeline Runner need these steps in order to be installed and configured: 1. Deploy to CVMFS or clone the libraries from the repository. 2. Define the configuration file for the systemd service into /etc/systemd/ system/euclid-ial-wfm.service/local.conf. 3. Define the configuration file for the Interface Abstraction Layer4 (IAL) into /etc/euclid-ial/euclid_prs_app.cfg. 4. Define server configuration file into /etc/euclid-ial/euclid_prs. cfg. 5. Load into the environment the two file /etc/profile.d/euclid.sh and /etc/euclid-ial/euclid_prs.cfg. 6. Load into the environment the variable EUCLID PRS CFG=/etc/euclid-ial/ euclid_prs_app.cfg. 7. Start the service daemond through the bash commands systemctl enable euclid-ial-wfm and systemctl start euclid-ial-wfm. As pipeline definition, the EPR needs the tasks to be defined with a bash command that in turn calls an Euclid project module. We have to create two files to build the pipeline: • Package definition: where the task are made wrapping the bash commands in an Executor object, defining also input, output and the max resources allowed to be allocated for the specific task. 4 An abstraction layer that allows the execution of for data processing software in each SDC independently of the IT infrastructure
  • 85. CHAPTER 5. COMPARISON 76 • Pipeline script: where the pipeline and its dependencies graph are constructed combining the executor objects outlined into the package definition. In this manner every task is standalone and reusable as it is in any pipeline. Therefore the EPR offer a good modular structure for tasks and pipelines and a built-in capability for input/output definition and parameters passing. 5.6.2 Luigi Installation and Configuration Luigi needs only three steps to be ready for use: 1. Install the Python package: pip install luigi. 2. Add luigid.service to /usr/lib/systemd/system defining a stan- dard systemd service configuration. 3. Start the webserver service daemon through the bash commands systemctl enable luigid and systemctl start luigid. Luigi doesn’t distinguish task definition and pipeline construction: the de- pendencies are specified directly inside each task, making them tight to a particular data flow. It’s however possible to write the task’s login withing an external function and reuse it in a modular way. Since the dependen- cies are explicit inside a task, Luigi offers a built-in system to define inputs and outputs and parameters passing throughout the pipeline. It doesn’t guarantee a clean code because every parameter has to be passed as input argument in each task downstream the one that needs it. However, there are implementations that compensate for this lack, e.g. SciLuigi5 . 5 https://github.com/pharmbio/sciluigi