Graphing the Empirical Research Process: Toward Modular Empirical Research

Figure 7. Research operation "2 uses two outputs
from operation "# and one from operation "! as its
inputs, and is independent of operation "".
If that which I have said above helps you to see empirical work in a n
since the researcher(s) responsible for the operation will
undoubtedly assign, for instance, different filenames for
different outputs.
By contrast, the namespace for labeling vertices,
that is, research operations is potentially global and
cuts across various governance boundaries. This makes
vertex labeling more difficult. It is relatively easy to
come up with a scheme to name research operations in
an individual research project, even if the project is large
and transcends organizational boundaries involving
many different researchers. However, interesting
opportunities arise if one could develop an infrastructure
that provides globally unique and searchable identifiers
for research operations (and, as we will discuss below,
verify their integrity). This would allow, in principle,
any research operation to reference outputs from any
other operation in the global network of empirical
research—note that I am not talking about the papers
RPG2
v1
v2
v3
v4
e1
e2
e3
Figure 7. Research operation v4 uses two outputs
from operation v2 and one from operation v1 as its
inputs, and is independent of operation v3.
If that which I have said above helps you to see
27 October 2021
INFORMS Annual Meeting
Graphing the Empirical Research Process:
Toward Modular Empirical Research
Aleksi Aaltonen
aleksi@temple.edu

Motivation
The nature of empirical research varies considerably between academic fields.
Methodological plurality and varying practices between academic communities
make it difficult to understand the process by which empirical studies produce their
results beyond one’s own niche.
This is a problem because it:
1. Makes research less transparent and reproducible
2. Hinders the re-usability of intermediate outputs in the research process

Research vs. Software Development Practices
In software development, we glue together existing, well-
tested and validated components while trying to write as little
new code as possible.
In empirical research, we tend to start from the scratch, from
‘raw’ data and do everything ourselves.

The Aim of the Project
Design a rigorous approach for modeling empirical research processes without
sacrificing the diversity of research. To this end, I make three assumption about
research:
1. Empirical research means producing a posteriori knowledge by justifying knowledge claims with
appropriately analyzed observations.
2. The observations are recorded on a relatively persistent medium as data.
3. An empirical study incorporates a process that starts from acquiring, simulating, or otherwise generating
data about a phenomenon of interest and then proceeds by performing analytical operations on the data.
The process can go through several iterations and dead ends until the data have
been transformed into a form in which they support a scholarly knowledge claim.

Definitions
Research process is a series of modular operations that transform data step-by-step into a form that
supports a posteriori knowledge claim.
Modularity entails dividing a complex system into relatively independent components so that the
relationships between the components are easily governable.
Research operation is the basic unit (module) of a research process. Internally, a research operation
is a bundle of closely related data manipulations. Externally, the manipulations that make up the
operation are separated from all other operations so that the individual operation can be
understood in isolation.
Graph is a representation of a structure formed by vertices that may or may not be connected by
edges.

Minimal Graph Based Representation
That is, status quo in the absence of common language
We can think G1 to capture the entire research
process as one massive operation, that is, the
process is collapsed into a tightly coupled bundle
of inputs, data manipulations, and outputs.
An effective description of the process must then
fall back on whatever idiosyncratic and field-
specific practices are available to the researcher.
Obviously, a graph-based representation is here
mostly superfluous…
arrive at the results. Such practices may be shared to within a specific
they are seldom pinned down as formally specified rules. It is p
projects are naturally like +!, that is, inherently difficult to break dow
A graph-based representation would then seem superfluous, although
the research externally as inputs to other research processes.
Figure 1. A minimal graph-based representation
an empirical research process
More interestingly, +! can be seen as the status quo in how resear
without a shared language: publications may include elaborate desc
that was performed to produce the results, but these tend to lack
available to the researcher to convey the steps that we
performed to arrive at the results. Such practices m
be shared to within a specific academic communi
but they are seldom pinned down as formally specifi
rules. It is possible that some research proje
are naturally like G1, that is, inherently difficult
break down into modular operations. A graph-bas
representation would then seem superfluous, althou
it might still help offering the research externally
inputs to other research processes.
G1
v1
Figure 1. A minimal graph-based representation o
More interestingly, G1 can be seen as the stat
quo in how research processes are described witho
a shared language: publications may include elabora
descriptions of empirical work that was performed
produce the results, but these tend to lack a structu
that would be immediately recognizable to fello
academics. Without extensive and often tacit knowled
A minimal graph-based representation of an
empirical research process

Vertices and Edges
We need a meaningful way to define the
vertices and edges of the research
process graph.
To account for the temporal order of
operations, we make the graph directed.
Vertices as outputs/inputs and edges as
research operations.
results in a number of problems that are worth exploring in some detai
must have a vertex at its both ends and hence a minimal graph-based m
(!", %"), would be !" = {"!, ""}, %" = {&!}. To account for the tem
operations, we also want to make +" a directed graph in which the set
ordered pair &! = ("!, ""). Figure 2 illustrates +".
Figure 2. Vertices as outputs/inputs and edges as
research operations
+" is immediately less elegant than +! as a starting point. It is difficult
counterpart similarly to +!. Also, the former would seem to suggest th
already existing output which does not make sense. Some material or e
k G1 to capture the entire research as one
ration; that is, the process is collapsed into
upled bundle of inputs, data manipulations,
.5
The description of the process must
ck on whatever idiosyncratic practices are
the researcher to convey the steps that were
o arrive at the results. Such practices may
o within a specific academic community,
seldom pinned down as formally specified
is possible that some research projects
y like G1, that is, inherently difficult to
into modular operations. A graph-based
on would then seem superfluous, although
ll help offering the research externally as
her research processes.
G1
v1
A minimal graph-based representation of
terestingly, G1 can be seen as the status
research processes are described without
guage: publications may include elaborate
be tempting to think edges as research operations
and vertices as their outputs, yet this results in a
number of problems that are worth exploring in some
detail. To begin with, an edge must have a vertex
at its both ends and hence a minimal graph-based
model, let us call this G2 = (V2, E2), would be
V2 = {v1, v2}, E2 = {e1}. To account for the tempora
order of research operations, we also want to make G2 a
directed graph in which the set of edges E2 is made of
an ordered pair e1 = (v1, v2). Figure 2 illustrates G2.
G2
v1 v2
e1
research actions
G2 is immediately less elegant than G1 as a starting
point. It is difficult to map G2 to a real-world
counterpart in a similarly to G1. Also, the former would
seem to suggest that research starts with an already
existing output which does not make sense. Some
material or events must of course exist for an empirica
research to start with, but the observations of any such
entities become research data only through the actions

Vertices and Edges
It is tempting to think edges as research
operations and vertices as their inputs/
outputs, but this results in several problems:
1. An edge must have vertices at both ends, which makes
the minimal graph G2 less elegant than G1 – it’s difficult
to map the former to a real-world counterpart in a
similar manner to the latter.
2. G2 suggests that research starts with an already existing
output, which does not make sense. Any observations
become data only through the actions of a researcher.
Vertices as outputs/inputs and edges as
research operations.
results in a number of problems that are worth exploring in some detai
must have a vertex at its both ends and hence a minimal graph-based m
(!", %"), would be !" = {"!, ""}, %" = {&!}. To account for the tem
operations, we also want to make +" a directed graph in which the set
ordered pair &! = ("!, ""). Figure 2 illustrates +".
research operations
+" is immediately less elegant than +! as a starting point. It is difficult
counterpart similarly to +!. Also, the former would seem to suggest th
already existing output which does not make sense. Some material or e
k G1 to capture the entire research as one
ration; that is, the process is collapsed into
upled bundle of inputs, data manipulations,
.5
The description of the process must
ck on whatever idiosyncratic practices are
the researcher to convey the steps that were
o arrive at the results. Such practices may
o within a specific academic community,
seldom pinned down as formally specified
is possible that some research projects
y like G1, that is, inherently difficult to
into modular operations. A graph-based
on would then seem superfluous, although
ll help offering the research externally as
her research processes.
G1
v1
A minimal graph-based representation of
terestingly, G1 can be seen as the status
research processes are described without
guage: publications may include elaborate
be tempting to think edges as research operations
and vertices as their outputs, yet this results in a
number of problems that are worth exploring in some
detail. To begin with, an edge must have a vertex
at its both ends and hence a minimal graph-based
model, let us call this G2 = (V2, E2), would be
V2 = {v1, v2}, E2 = {e1}. To account for the tempora
order of research operations, we also want to make G2 a
directed graph in which the set of edges E2 is made of
an ordered pair e1 = (v1, v2). Figure 2 illustrates G2.
G2
v1 v2
e1
research actions
G2 is immediately less elegant than G1 as a starting
point. It is difficult to map G2 to a real-world
counterpart in a similarly to G1. Also, the former would
seem to suggest that research starts with an already
existing output which does not make sense. Some
material or events must of course exist for an empirica
research to start with, but the observations of any such
entities become research data only through the actions

perspectives on data that are being processed. Finally, a research opera
one output as its input, which is difficult to model if we define edg
illustrated by +# in Figure 3.
Figure 3. Research operation !! with multiple input
results in an ill-defined graph
Given the problems with +" and +#, and the overall approach they repr
research operations including their outputs. An operation is therefore
G3
v1
v2
v3
e1
e1
Figure 3. Research operation e1 with multiple inputs
results in an ill-defined graph
the output within the research operation. It follows from
this is that the references, that is, edges between research
operations will actually point backwards in time. This
makes it intuitive to trace research results back to the
operations that contributed to them and are needed to
Vertices and Edges
3. A research operation can use more than one
output as its input, which is difficult to model if we
define research operations as edges.
Research operation with multiple inputs
results in an ill-defined graph.

Research Process Graph, RPG
Research operation references the outputs
of two earlier operations as its inputs.
("*+(,, "-.+/) where "*+(,, "-.+/ ∈ ! and % = {&!, &", &#, … , &'}. The
strictly preceding operations, since an operation can only use existi
Figure 4 illustrates how 01+ solves the problem of capturing multipl
minimal, 01+ = (("!), ∅), is isomorphic with our elegant starting poin
Figure 4. Research operation "" references the
outputs of two earlier operations "! and "# as its
inputs
The edges of 01+ record the order of operations for each path in the g
can arise if we need to know the order of two operations that do not
outputs from other research operations. The edges are
thus directed and defined as ordered pairs of operations,
e = (vtail, vhead) where vtail, vhead 2 V and E =
{e1, e2, e3, ..., em}. The edges can only point to strictly
preceding operations, since an operation can only use
existing outputs as its inputs. Figure 4 illustrates how
RPG solves the problem of capturing multiple inputs in
G3. Also, the minimal, RPG = ((v1), ;), is isomorphic
with our elegant starting point G1.
RPG1
v1
v2
v3
e1
e2
Figure 4. Research operation v3 references the
outputs of two earlier operations v1 and v2 as its
inputs
The edges of RPG record the order of operations
for each path in the graph. However, problems can arise
if we need to know the order of two operations that
do not appear on the same path. This can happen, for
Let us define vertices as research operations
including their outputs, and edges as references to
outputs of earlier research operations.
It follows that references (edges) point backward in time.
This makes it intuitive to trace back a research operation
(results) to all those operations that contributed to it and
makes actual implementation of the graph more
straightforward.
Note that the minimal RPG is again isomorphic with our
elegant starting point G1

("*+(,, "-.+/) where "*+(,, "-.+/ ∈ ! and % = {&!, &", &#, … , &'}. The
inputs
RPG1
v1
v2
v3
e1
e2
inputs
Graphing the Research Process
albeit the order of operations in ! and other edges in % may rule out such a possibility.
e 5 summarizes the definition of 01+.
1: #$% = ((, *)
2: ( = (,$, ,%, ,&, … , ,'), where . > 0
3: Research operations are added to ( in a non-decreasing order according
to their completion time.
5: * = {2$, 2%, 2&, … , 2(}, where 4 ≥0
6: 2 = (,)*+,, ,-+./), where ℎ278 < :7;<
7: #$% is a directed acyclic graph.
Figure 5. The definition of research process graph
does not have to be a simple or connected graph. There may be parallel edges, that is, more
one reference between two operations in the case the latter uses two different outputs from
rmer. To distinguish between the parallel edges and, more generally, to help identify the
t that is being referenced, we add a set of edge labels ;'
1
. We also add a set of vertex
;%
2
that allows to construct unique identifiers for research operations and their outputs by
ining vertex and edge labels together. To serve their function as identifiers, the edge labels
However, to allow mapping the graph to actual
research processes, we need identifiers for
research operations and references (edges).

("*+(,, "-.+/) where "*+(,, "-.+/ ∈ ! and % = {&!, &", &#, … , &'}. The
inputs
RPG1
v1
v2
v3
e1
e2
inputs
albeit the order of operations in ! and other edges in % may rule out such a possibility.
e 5 summarizes the definition of 01+.
1: #$% = ((, *)
2: ( = (,$, ,%, ,&, … , ,'), where . > 0
3: Research operations are added to ( in a non-decreasing order according
to their completion time.
5: * = {2$, 2%, 2&, … , 2(}, where 4 ≥0
6: 2 = (,)*+,, ,-+./), where ℎ278 < :7;<
7: #$% is a directed acyclic graph.
Figure 5. The definition of research process graph
does not have to be a simple or connected graph. There may be parallel edges, that is, more
one reference between two operations in the case the latter uses two different outputs from
rmer. To distinguish between the parallel edges and, more generally, to help identify the
t that is being referenced, we add a set of edge labels ;'
1
. We also add a set of vertex
;%
2
that allows to construct unique identifiers for research operations and their outputs by
ining vertex and edge labels together. To serve their function as identifiers, the edge labels
cture that provides globally unique and searchable identifiers for research operations
we will discuss below, verify their integrity). This would allow, in principle, any
operation to reference outputs from any other operation in the global network of
al research—note that I am not talking about the papers published from the study but the
empirical operations that were performed to come up with the results. This may sound a
hed vision, yet it is exactly what happens in software development, where newly written
ypically a fraction of the total codebase of a new product.
8: =0
= {<$
0
, <%
0
, <&
0
, … , <'
0
}, where . is the number of vertices in #$%
9: =1
= {<$
1
, <%
1
, <&
1
, … , <(
1
}, where 4 is the number of vertices in #$%
Figure 6. Adding vertex and edge labels to >?@
isolated vertices and separate graph components can capture false starts and separate
inquiry that did not contribute to each other or did not lead to useful findings, but are still
hile to report as they may provide valuable resources to other studies. These
ations lead to two important questions concerning the boundaries of 01+. First, one
choose whether to include in 01+ false starts, all the iterations, and separates lines of

Toward a Global Research Graph
The graph-based notation of research process graph leaves many practical issues
open.
1. The identification of research operations and their outputs in practice
2. The degree of process modularization
3. Research operation metadata
4. The stability and verifiability of research operations

The Vision
Research process graphs could evolve into boundary objects in academic
communication that allow researchers to make intermediary outputs from
empirical operations broadly available to each other.
To explore the feasibility of a global research graph, future studies should:
1. Develop algorithms to create visually appealing ways to plot RPGs
2. Assess the idea of RPG with respect to existing infrastructures
3. Simulate the benefits of modular empirical research

Graphing the Empirical Research Process: Toward Modular Empirical Research

Recommended

Recommended

More Related Content

Similar to Graphing the Empirical Research Process: Toward Modular Empirical Research

Similar to Graphing the Empirical Research Process: Toward Modular Empirical Research (20)

More from Aleksi Aaltonen

More from Aleksi Aaltonen (6)

Recently uploaded

Recently uploaded (20)

Graphing the Empirical Research Process: Toward Modular Empirical Research