Scalable architectures for phenotype libraries

Scalable architectures for phenotype libraries
Building (inter)national phenotype libraries
S39
Martin Chapman
King’s College London
#AMIA2023
AMIA 2023 Annual Symposium | amia.org 1

Disclosure
I and my spouse/partner have no relevant relationships with commercial interests to disclose.

Learning objectives
After participating in this session the learner should be better able to:
• Understand how the software architecture, definition structure and hosting mechanisms
behind a phenotype library affect the accessibility of hosted phenotypes, and thus their
impact.

Overview
‘The definitions in a phenotype library can only have an impact if they are accessible at scale.’
What do we mean by ‘accessible at scale’:
1. Can be downloaded from a library successfully by a large number of users (software
architecture)
2. Can be interpreted (and thus implemented) by a range of different users (with knowledge
of a range of different programming languages) working with a range of different datasets
(definition architecture)
3. Can be successfully located by a broad range of users (distribution architecture)

Running examples: Phenoflow and OHDSI
Throughout, I will refer to two phenotype libraries, one of our own (Phenoflow), and for a broader
perspective a popular, third-party library developed by OHDSI1
:
Figure 1: Phenoflow phenotype library Figure 2: OHDSI phenotype library overview
1
I am not directly connected to OHDSI or an expert in their tools

1. Software architecture

Building phenotype libraries
Phenotype libraries are, or use as a part of their
ecosystems, web applications.
As such, we have a choice about how we build
these applications.
If we don’t build the application in a suitable way,
we may fail to actually get phenotype defini-
tions to people when, for example, there is high
demand on a library.
In other words, the library may fail to scale.

Research software vs. user software
Why may phenotype libraries not be built in a suitable way?
Phenotype libraries are often built by researchers, who may have experiences, preferences and
goals that differ from those that would lead to ensuring a library is scalable.
For example, researchers are often familiar with and favour the use of languages like Python,
which does not necessarily scale as well as languages like V8-compiled Javascript.
Overall, there is often a tension between research software requirements and the requirements
of software that is suitable for (large numbers of) users.

Microservices
To try and balance these requirements we can consider how to structure our software.
Figure 3: Example customer information microservice. Newman, 2019
A microservice design approach suggests that a system should be separated into individual
communicating services (often via HTTP), each of which provides a single piece of overall
system functionality.

Impact of microservices
Because of the modularity of a microservice architecture, each service can be built using a
different language, allowing languages to be combined.
Therefore, user-facing components can be built using scalable languages, leaving
researchers to build the remaining components in languages that best suit them.
We refer to this as technological heterogeneity.
We also gain the ability to replicate components (scalability), isolate components with long
execution times in order to ensure the remainder of the system is not affected (resilience) and
replace components with minimal impact to the rest of the system (replaceability).

Phenoflow architecture
Web Portal/API
Generator
Visualiser
Implementation
Units
VC server
Author(s)
User
customise
workflow,
visualisation,
implementation units
author,
expand
data
workflow
workflow
visualisation
Figure 4: Phenoflow’s microservice architecture
Martin Chapman, Luke Rasmussen, et al. (2021). “Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype Definitions”. In: AMIA Joint Summits on Translational Science,
pp. 142–151

Phenoflow stack

OHDSI architecture
The OHDSI phenotype library lever-
ages software like ATLAS, for which
several key architectural considerations
have been made, particularly the use of
containerisation (Docker) through ini-
tiatives like Broadsea.
OHDSI software like HADES (a set of
R packages for analytics) is also Dock-
erised. https://github.com/OHDSI/Broadsea

(Slight detour...) CONSULT I
To show the benefits of a microservice approach when developing research software, we can
briefly consider CONSULT, a decision-support system for stroke patients that was developed
under the paradigm.
Figure 5: CONSULT’s dashboard interface Figure 6: CONSULT’s chatbot interface

(Slight detour...) CONSULT II
Blood pressure
(Withings API)
Pulse and Activity
(Garmin API)
Heart Rate / ECG
(Medibiosense API)
EHR
(EMIS)
Device
Integration
(Withings)
Device
Integration
(Garmin)
Device
Integration
(Vitalpatch)
Sensor-FHIR
converter
EHR Integration
(EMIS)
EHR-FHIR
converter
FHIR Health
Data Server
Message
Passer
Dialogue
Manager
Authentication
Server
Provenance
Server
Data
Miner
Argumentation
Engine
Tablet
Browser
Chat
Server
UI backend
PC
Browser
Sensor data
Sensor
data
Sensor
data
EHR
data
FHIR resources
FHIR resources
FHIR resources
Processed
patient data
Patient
data
Substitution
Credentials
Processed
patient data
Processed patient data,
goal
Results
Dialogue responses
Data summaries, tips
Figure 7: CONSULT’s microservice architecture

Scalability in practice
CONSULT’s user-facing components are
built using scalable languages, while its
remaining components are built using lan-
guages that are more traditionally found in
research software. Its components can also
be replicated.
We tested the ability for the CONSULT ar-
chitecture (specifically its sensor integration
components) to respond to high load and
obtained positive results.
Martin Chapman, Abigail G-Medhin, et al. (2022). “Using Microservices to Design
Patient-facing Research Software”. In: Proceedings of the IEEE 18th International Con-
ference on e-Science (e-Science), pp. 44–54
monolithic CONSULT
0
20000
40000
60000
80000
100000
120000
140000
Average
responses
Ok
Timeout
Figure 8: How CONSULT responds to high load vs.
an emulated monolithic architecture

Microservices in industry
The use of microservices in industrial settings (and in the wider software development
community) is commonplace.
However (at least in our experience) research software does not adopt industry paradigms like
this.
Wider goal: encourage the use of established software engineering techniques when developing
research software.

2. Definition architecture

Phenotype definition challenges
• Phenotype definitions come in lots of different forms (flowcharts, text descriptions,
weights for a classifier, etc.) and lack standardisation. This reduces intelligibility and thus
phenotypic reproducibility at scale (how broadly the logic intended by the definition author
can be accurately implemented).
• Computable phenotypes often don’t exist at all. This affects phenotypic portability (the
effort associated with implementing a definition is high, limiting its adoption at scale).

We need standardised models to structure definitions.

We need standardised models to structure definitions.
We need mechanisms for generating and storing computable forms of definitions.

Phenoflow’s definition model I
A new Common Workflow Language (CWL)-based model for the definition of a phenotype:
number group id description type
step
Input Output
id description id description extensionA
pathA languageA paramsA
implementationUnitA
Computational
Implementation
Units
pathB languageB paramsB
implementationUnitB
Abstract
Functional
Figure 9: CWL-based definition model (step) and implementation units*.
*the bits of code actually executed by definitions structured under this model; separate from the model itself.

Phenoflow’s definition model II
Model is separated into layers:
• Abstract: Expresses the logic of a phenotype through a set of simple sequential, potentially
nested steps, each of which is annotated with multiple descriptions. Emphasis on
intelligibility.
• Functional: Specifies the metadata of entities passed between the operations within the
abstract layer, e.g., the format of an intermediate cohort.
• Computational: Defines an environment for the execution of one or more implementation
units (e.g. a script, data pipeline module, etc.) for each step in the abstract layer. Inherently
supports implementation by providing a template for development in any language.

Phenoflow’s definition model III
2 - icd10 A case is identified in the presence of
patients associated with the stated icd10
COVID-19 codes.
logic
step
Input Output
covid19_cohort Potential covid19
cases.
covid19_cases_icd10 covid19 cases, as
identified by icd10
coding.
csv
icd10.py python -
for row in csv_reader :
newRow = row . copy ( )
for c e l l in row :
i f [ value for value in
row [ c e l l ] . s p l i t ( " , " )
i f value in codes ] :
newRow[ " covid19 " ] = "CASE"
...
Computational
Implementation
Units
icd10.js javascript -
for ( row of csvData ) {
newRow = row . s l i c e ( ) ;
for ( c e l l of row ) {
i f ( c e l l . s p l i t ( " , " )
. f i l t e r ( code=>codes .
indexOf ( code) > −1). length ) {
newRow. push ( "CASE" ) ;
...
Abstract
Functional
Figure 10: Individual step of COVID-19 phenotype definition and implementation units.

Phenoflow definition parsing
In addition to providing an intelligible model that supports different implementations,
Phenoflow also actively parses definitions from a variety of sources (including the HDR UK
phenotype library) under this model, thus providing pre-made computable forms.
This also solves the ‘it’s only useful if it’s used’ issue often associated with any kind of model.
Web Portal/API
Generator
Visualiser
Implementation
Units
VC server
Author(s)
User
customise
workflow,
visualisation,
implementation units
author,
expand
data
workflow
workflow
visualisation

OHDSI OMOP phenotypes (cohorts)
The phenotypes found in the OHDSI phenotype li-
brary also have an expected structure, and thus
exhibit many of the same benefits.
While this structure is tied to the OMOP CDM,
there is work going on around OMOP interoper-
ability.
Figure 11: An OHDSI phenotype definition (cohort)

Phenoflow connectors
The Phenoflow model imposes a number of other constraints, including:
• The first step must be of a connector type (currently load or
external), designed to extract data from a datasource without
performing any processing on that data, and pass it to the second
step.
• Other steps in a definition must describe the logic of the phenotype
(types currently boolean logic and generic logic (supporting, for
example, case exclusion)).
OMOP FHIR
Step 2
Step 3
Together, these two elements of the model promote interoperability with a variety of different
data standards (including OMOP itself).
Martin Chapman, Luke V Rasmussen, et al. (2022). “Connecting computable phenotypes with multiple Health IT Standards using the Phenoflow library”. In: AMIA Clinical Informatics Conference

3. Distribution architecture

Finding definitions
The ability to locate a phenotype definition is also a key part of its accessibility at scale.
When designing a library, we have a choice about where to host it and which existing platforms
and technologies to potentially connect to.
These choices can impact the discoverability of the definitions hosted.

Version control systems I
OHDSI’s phenotype library is (in part) hosted
on GitHub, a remote version control system
(VCS).
This neatly provides a mechanism to distribute
phenotypes and ensure they can be located (by
considering the FAIR principles), while at the
same time having important library features
available such as versioning.
Martin Chapman, Shahzad Mumtaz, et al. (2021). “Desiderata for the development of next-
generation phenotype libraries”. In: Gigascience 10.9, pp. 1–13
Figure 12: OHDSI’s phenotype library on GitHub

Version control systems II
Inspired by OHDSI’s approach, Phenoflow is being migrated to a VCS (GitHub) backed.
API Generator
Visualiser GitHub
Author(s)
User
query
link to workflow
+ implementation units and
visualisation
author,
expand data
workflow
index
workflows
Figure 13: Phenoflow’s new VCS-backed architecture
In doing so, we aim to lever-
age even more of the features
provided by a VCS, including
the use of branches for dif-
ferent connectors.
Martin Chapman, Luke V Rasmussen, et al. (2023). “Using Version Control Systems to Support High-Quality Phenotype Definitions”. In: AMIA Joint Summits on Translation Science, p. 816

Summary
It is important that the definitions we host in phenotype libraries are accessible at scale.
Definitions are accessible at scale if they can be easily located, downloaded and interpreted by
large numbers of users.
We can ensure this is the case by carefully considering how we structure phenotype libraries and
the definitions they contain.
If definitions are accessible they can have an impact. They can, for example, be reused,
ultimately supporting reproducible research.

Thank you!

Scalable architectures for phenotype libraries

Recommended

Recommended

More Related Content

Similar to Scalable architectures for phenotype libraries

Similar to Scalable architectures for phenotype libraries (20)

More from Martin Chapman

More from Martin Chapman (20)

Recently uploaded

Recently uploaded (20)

Scalable architectures for phenotype libraries