Tales from BioLand - Engineering Challenges in the World of Life Sciences

Tales from BioLand
Engineering Challenges in
the World of Life Sciences
Prof. Alfredo Benso
Politecnico di Torino, Torino, Italy
www.sysbio.polito.it
ICIIBMS - International Conference on
Intelligent Informatics and BioMedical
Sciences

Engineering is for BUILDING systems
Life Sciences are for UNDERSTANDING
(REVERSE ENGINEERING) life

• “Reverse engineering is the process of discovering the
functional principles of a device, object, or system
through analysis of its structure, function, and
operation”.
• “Reverse engineering, is the processes of extracting
knowledge from something and reproducing it or
reproducing anything, based on the extracted
information”
Reverse Engineering

Example
Enigma machine
DESIGNING AND REVERSE ENGINEER A
SYSTEM ARE TWO OPPOSITE TASKS
WHOSE COMPLEXITY MAY DIFFER BY
ORDERS OF MAGNITUDE

I was once asked “So is it easier to
synthesize an organism than
understanding it?”
“What I cannot create, I do not understand”, Feynman 1988

Reductionism (Bottom-up)
• Understanding of the parts means
understanding of the whole
• Focus on parts
The properties of the whole
system can be explained in terms
of its parts.
Reverse Engineering
Reductionism vs holism
Holism (Top-down)
• To understand the whole we must
understand also the relations
between the parts in the whole
• Focus on relationships
The system cannot be explained by
component parts alone.
Instead, the system as a whole
determines how the parts behave.

Reverse Engineering
BOTTOM-UP / REDUCTIONIST
DATA driven
SYSTEMIC / TOP-DOWN /
HOLISTIC
MODEL driven
Pathway / biological networks
Genes

OUTLINE
The Methodology Challenge
The Data Challenge
The Modelling Challenge

Problem
DOES THE COMPLEXITY (IN MATHEMATICAL TERMS) OF
THE SYSTEM DRIVE THE METHODOLOGICAL APPROACH
TO BE USED TO UNDERSTAND (REVERSE ENGINEER)
IT?
The Methodology
Challenge
YES!!!!!!!

Linear vs Complex Systems
The Methodology
Challenge

• We can consider the effect of each system component
(“variable”) separately, because the sum of their
effects equals the effect of their sum.
Linear Systems
The Methodology
Challenge

• Enzime / Substrate
• Simple pharmacokinetics
• Newtonian Physics
Linear Systems
Examples
The Methodology
Challenge

• Linear systems are easy to understand also for
non-mathematicians and are also easy to visualize.
• For this reason a large part of the LS world (and the
medical one in particular) still reasons in linear
terms.
• Linearity is rare in biological systems!
Linear Systems
The Methodology
Challenge

• Properties emerge from the interaction of
their parts (and cannot be predicted only
from the properties of the parts).
Complex Systems
• Complex Systems’ dynamics heavily depend on initial
conditions and perturbations (the butterfly effect…..)

Complex Systems
Examples
Is MUSIC
discoverable by
studying the record
and the player
separately?
Is cell LIFE
understandable by
separately studying
its internal
components?

• Can we learn how a car engine work just by studying
(some) of its individual components separately?
Issue
The Methodology
Challenge

Linear Systems ➜ Reductionism / Holism
Complex Systems ➜ Reductionism / Holism
The Methodology
Challenge
It would NOT BE ABLE TO
IDENTIFY properties that
emerge from the interactions
between its parts
Linear vs Complex Systems

The Methodology
Challenge
CHALLENGES

• Middle ground between data driven and model driven
approaches
• Cross-fertilization between biology and physics,
computer science, mathematics, chemistry, and
engineering.
The Methodology
Challenge

The role of
Systems Biology
The Methodology
Challenge

raw DATA MODEL
The Systems Biology
Lifecycle
decomposition
and
localization
Dynamic
modeling
The Methodology
Challenge
SIMULATION
New Hypothesis
New biological
questions
Model refinementData recomposition

• Systems in this context generally are modeled as large
networks of integrated components exhibiting non-linear
dynamical interactions.
• Protein-Protein-Interaction
• Gene Regulatory Networks
• Metabolic Networks
• Interactomes
A shift in perspective
The Methodology
Challenge

A shift in perspective
• In 1998, Oltvai, a cell biologist, and Barabasi, a physicist
which was studying the structure of internet, were home
neighbors in Chicago
• At the time, Barabasi had already shown that internet is a
non-random network, and that its connectivity structure
influences its function
• One year later, in 1999, they proved that the metabolic
pathways of yeast define a network whose structure is
very similar to that of internet.
Then…
• HUBs
• P53
• Motifs
• ….
The Methodology
Challenge

• High throughput DATA (NGS, ‘omics’, imaging, …) is
the “facilitator”.
• The technology to create this data (Biotechnology) is the
key. The “wider” and the better we see, the better we can
understand how systems work.

Systems
Biology
Systems
Medicine
Personalized
Medicine
The Methodology
Challenge

 Size
 Heterogeneity
 Curse of dimensionality
 Ownership / Ethics
 Falsification
Issues
The Data Challenge

Size
 Heterogeneity
 Falsification
Issues
The Data Challenge

The Cost of
Sequencing DNA Has
Fallen Over 100,000x
in the Last Ten Years
The Data Challenge: size

• Enormous Density
⁃ 1000x Ocean Water
• Highly Dynamic Microbial Ecology
⁃ Hundreds to Thousands of Species
• Horizontal Gene Transfer
• Adaptive Selection Pressures (Immune System)
⁃ Innate and Adaptive Immune System
⁃ Macrophages and Antimicrobial proteins
• Constantly Changing Environmental Pressures
⁃ Diet
⁃ Antibiotics
⁃ Pharmaceuticals
The human Microbiome

Your Microbiome is
Your “Near-Body” Environment
and its Cells
Contain 200-2000x
as Many DNA Genes
As Your Human Cells
More Microbe Cells Than Human Cells
DNA-bearing Cells in Your Body
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
Human Genome Metagenomics

To Map Out the Dynamics of Autoimmune Microbiome Ecology
Couples Next Generation Genome Sequencers to Big Data Supercomputers
Source: Weizhong Li, UCSD
UCSD Team Used 25 CPU-years
to Compute
Comparative Gut Microbiomes
Starting From
2.7 Trillion DNA Bases
of Healthy and IBD Subjects samples
Illumina HiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
Using Machine Learning to Determine Major Differences
Between Gut Microbiome in Health and Disease,
Mehrdad Y. et al., IEEE International Conference on Big Data (December 5-8,
2016)

In a “Healthy” Gut Microbiome:
Large Taxonomy Variation, Low Protein Family Variation
Source: Nature, 486, 207-212 (2012)
Over 200 People

 Size
Heterogeneity
 Falsification
Issues
The Data Challenge

Issues
Sources
180+ Bio/related ONLINE
Databases [NAR 2016]
Custom / Proprietary data
Medical data
The Data Challenge: heterogeneity
• Many de-facto standards
• No compatibility
• Overlaps
• Validation / Quality

 Size
 Heterogeneity
Curse of dimensionality
 Falsification
Issues
The Data Challenge
Overfitting
Many variables & Few
samples
Machine Learning issues

 Size
 Heterogeneity
Ownership / Ethics
 Falsification
Issues
The Data Challenge
Non reproducible results
No comparable
approaches
Difficult improvements
Private ≠ Not sharable
Many data sets are not
shared

 Size
 Heterogeneity
Falsification
Issues
The Data Challenge

• How do a biologist demonstrates in a paper that
he/she actually performed an experiment?
Falsification
The Data Challenge: falsification

What is a scientific fraud?
Experiment was never performed
Experiment was performed
Fictious data
Altered data
Duplicated data
FABRICATION
FALSIFICATION
PLAGIARISM

Example 1: reusing panels

Example 2: ROI reusing

Example 3: repeated patterns

Image manipulation by country
The manipulation rate is around 6% in general, around 17% for papers containing at
least 1 gel image
R² = 0.7138
0
5
10
15
20
25
0 50 100 150 200 250 300
ManipulatedPapers
Overall Examined Papers
Enrico Bucci, PhD, SCI 2017 – Paestum, September 11,
Experiment 1
1364 random papers
(5000 images) from
PMC, published in Jan
2014

Evolution of manipulation rate over 5 years
Increase from 6.5% to 13.1% of manipulated images content over 5 years
Submission rate increases 10 times.
20 papers were found to reuse previously published images.
Experiment 2
1546 papers published
in Cell Death and
Disease (NPG) from
2010-2014

Manipulation frequency per
originating country
25.3%
20.6%
17.9% 16.7%
8.3% 7.1% 3.7%
% of submitted papers which were flagged (for countries
submitting at least 10 manuscripts)
Country 1 Country 2 Country 3 Country 4 Country 5 Country 6 Japan 
Experiment 3
Submitted manuscripts
from Jun to Aug 2017 to
5 independent journals.
20% of manipulated
images

WE DO
PRECISION GUESS WORK
BASED ON
(SOMETIMES) UNRELIABLE
DATA!!!
The Data Challenge

• De-facto standards are dangerous
The Data Challenge
What happens when
the standard is
“unilaterally” changed
by its “owner” ?
Past versions may
become unusable,
and published work
obsolete

• Standardization Authority (like IEEE, ISO, …) for
standardization of:
⁃ ACCESSION (Entrez GeneId, mirBASE accession, ….)
⁃ NAMING (Hugo)
⁃ DATA EXCHANGE (MIAME for Microarrays)
⁃ …
The Data Challenge

raw DATA MODEL
The Systems Biology
Lifecycle
The Data Challenge
SIMULATION
decomposition
and
localization
Dynamic
modeling
New Hypothesis
New biological
questions

• Common practice (in other fields) to compare
algorithms on the same data sets (E.g. ISCAS circuits,
SPEC software)
• They increase competitiveness, comparability,
reproducibility, REUSABILITY
• There are efforts (BioPerf, BAliBASE, Affycomp), but
there is a general lacks of “benchmarking culture”.
The Data Challenge

Model requirements
HIERARCHY
ENCAPSULATION
DYNAMICS!!!!
STOCHASTICITY
SPATIALITY
MOBILITY
SELECTIVE COMMUNICATION

“A Whole-Cell Computational Model Predicts Phenotype from Genotype”
A model of
Mycoplasma genitalium,
• 525 genes
• Using 1,900 experimental
observations from 900
studies, they created the
software model, which requires
128 computers to run.

KNOWLEDGE BASES vs. PREDICTION
DATA-DRIVEN vs. HYPOTHESIS-BASED
Modeling approaches
SYSTEMS BIOLOGY

 Granularity
 Pathways (KEGG, Reactome, Ingenuity, …)
 Only Gene-2-Gene networks. Missing miRNA interactions,
TF, lncRNA, …
 Scalability
 Boolean Networks
 2 values logic to model a continuous phenomena
(expression)
 But…. Simulation complexity grows exponentially!!!
• 100 genes = 2^100 = 10^30 states ➔ N^100
 The Data challenge!
Issues

MODEL CHALLENGES

• Machine Learning is the ability of computer systems to
infer their own knowledge, by extracting patterns from
raw data.
• Deep Learning (DL) avoids the need for human
operators to formally specify all of the knowledge
• DL achieves great power and flexibility by representing
the world as a hierarchy of concepts self-generated.
MODEL CHALLENGES

• Traditional ML approaches might have to be
optimized to adapt to the peculiar characteristics of
biological data (eg. curse of dimensionality)
• Often parameter-driven
• Benchmarks are needed
MODEL CHALLENGES

SHAPE
Reinhardtius hippoglossoidesPleuronectes platessa
How to make decision?
MORPH. DETAILS etc ...TEXTURE

F.I.S.HUB
Knowledge-Based: features are
hardcoded into the classifier.
Sardina pilchardus
v.s. Sprattus sprattus:
m= 1,129
Hippoglossus hippoglossus
v.s. Microstomus kitt:
m=1,612
Merlangius merlangus
v.s. Pollachius virens:
m= 1,741
wrong Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
es
Sardina pilchardus
m= 1,129
m=1,612
m is the metric
m= 1,741
wrong Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
Sardina pilchardus
m= 1,129
m=1,612
m= 1,741
wrongamples
Sardina pilchardus
m= 1,129
m=1,612
m is the metric
m= 1,741
wrongExamples
Sardina pilchardus
m= 1,129
m=1,612
m is the metric
m= 1,741
wrongamples
Sardina pilchardus
m= 1,129
m=1,612
m is the metric
m= 1,741
wrong
I.S.HUB – Classifier results
Deep Learning: features are discovered
by the neural network
25 spieces
> 15k
photos
UK & IT Acc. > 92%

Imagine ...
• The number of variables is so huge that we can
easily picture parts of the landscape that look (to
us) almost identical, but may be different in
small details.

raw DATA MODEL
The Systems Biology
Lifecycle
SIMULATION
decomposition
and
localization
Dynamic
modeling
New Hypothesis
New biological
questions

• Multidisciplinary Teams/Individuals
• Planning Education
• Good/Bad Practices
• Fill the language gap
MODEL CHALLENGES

• Benso A.; Di Carlo S.; Politano G.; Savino A.; Bucci E.
Alice in "Bio-land": engineering challenges in the
world of Life-Sciences IT PROFESSIONAL, Vol.16,
pp.38-47, ISSN: 1520-9202,
DOI: 10.1109/MITP.2014.45
Related readings

Tales from BioLand - Engineering Challenges in the World of Life Sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Tales from BioLand - Engineering Challenges in the World of Life Sciences

Similar to Tales from BioLand - Engineering Challenges in the World of Life Sciences (20)

Recently uploaded

Recently uploaded (20)

Tales from BioLand - Engineering Challenges in the World of Life Sciences