Prof. Alfredo Benso from SysBio Group @ Politecnico di Torino keynote presentation at ICIIBMS - IEEE International Conference on Intelligent Informatics and BioMedical Sciences, on Nov 26 2017 in Okinawa (Japan).
Tales from BioLand - Engineering Challenges in the World of Life Sciences
1. Tales from BioLand
Engineering Challenges in
the World of Life Sciences
Prof. Alfredo Benso
Politecnico di Torino, Torino, Italy
www.sysbio.polito.it
ICIIBMS - International Conference on
Intelligent Informatics and BioMedical
Sciences
3. Engineering is for BUILDING systems
Life Sciences are for UNDERSTANDING
(REVERSE ENGINEERING) life
4. • “Reverse engineering is the process of discovering the
functional principles of a device, object, or system
through analysis of its structure, function, and
operation”.
• “Reverse engineering, is the processes of extracting
knowledge from something and reproducing it or
reproducing anything, based on the extracted
information”
Reverse Engineering
6. I was once asked “So is it easier to
synthesize an organism than
understanding it?”
“What I cannot create, I do not understand”, Feynman 1988
7.
8. Reductionism (Bottom-up)
• Understanding of the parts means
understanding of the whole
• Focus on parts
The properties of the whole
system can be explained in terms
of its parts.
Reverse Engineering
Reductionism vs holism
Holism (Top-down)
• To understand the whole we must
understand also the relations
between the parts in the whole
• Focus on relationships
The system cannot be explained by
component parts alone.
Instead, the system as a whole
determines how the parts behave.
9. Reverse Engineering
BOTTOM-UP / REDUCTIONIST
DATA driven
SYSTEMIC / TOP-DOWN /
HOLISTIC
MODEL driven
Pathway / biological networks
Genes
12. Problem
DOES THE COMPLEXITY (IN MATHEMATICAL TERMS) OF
THE SYSTEM DRIVE THE METHODOLOGICAL APPROACH
TO BE USED TO UNDERSTAND (REVERSE ENGINEER)
IT?
The Methodology
Challenge
YES!!!!!!!
14. • We can consider the effect of each system component
(“variable”) separately, because the sum of their
effects equals the effect of their sum.
Linear Systems
The Methodology
Challenge
15. • Enzime / Substrate
• Simple pharmacokinetics
• Newtonian Physics
Linear Systems
Examples
The Methodology
Challenge
16. • Linear systems are easy to understand also for
non-mathematicians and are also easy to visualize.
• For this reason a large part of the LS world (and the
medical one in particular) still reasons in linear
terms.
• Linearity is rare in biological systems!
Linear Systems
The Methodology
Challenge
17. • Properties emerge from the interaction of
their parts (and cannot be predicted only
from the properties of the parts).
Complex Systems
• Complex Systems’ dynamics heavily depend on initial
conditions and perturbations (the butterfly effect…..)
The Methodology Challenge
19. • Can we learn how a car engine work just by studying
(some) of its individual components separately?
Issue
The Methodology
Challenge
20. Linear Systems ➜ Reductionism / Holism
Complex Systems ➜ Reductionism / Holism
The Methodology
Challenge
It would NOT BE ABLE TO
IDENTIFY properties that
emerge from the interactions
between its parts
Linear vs Complex Systems
22. • Middle ground between data driven and model driven
approaches
• Cross-fertilization between biology and physics,
computer science, mathematics, chemistry, and
engineering.
The Methodology
Challenge
24. raw DATA MODEL
The Systems Biology
Lifecycle
decomposition
and
localization
Dynamic
modeling
The Methodology
Challenge
SIMULATION
New Hypothesis
New biological
questions
Model refinementData recomposition
25. • Systems in this context generally are modeled as large
networks of integrated components exhibiting non-linear
dynamical interactions.
• Protein-Protein-Interaction
• Gene Regulatory Networks
• Metabolic Networks
• Interactomes
A shift in perspective
The Methodology
Challenge
26. A shift in perspective
• In 1998, Oltvai, a cell biologist, and Barabasi, a physicist
which was studying the structure of internet, were home
neighbors in Chicago
• At the time, Barabasi had already shown that internet is a
non-random network, and that its connectivity structure
influences its function
• One year later, in 1999, they proved that the metabolic
pathways of yeast define a network whose structure is
very similar to that of internet.
Then…
• HUBs
• P53
• Motifs
• ….
The Methodology
Challenge
27. • High throughput DATA (NGS, ‘omics’, imaging, …) is
the “facilitator”.
• The technology to create this data (Biotechnology) is the
key. The “wider” and the better we see, the better we can
understand how systems work.
The Methodology Challenge
32. The Cost of
Sequencing DNA Has
Fallen Over 100,000x
in the Last Ten Years
The Data Challenge: size
33. • Enormous Density
⁃ 1000x Ocean Water
• Highly Dynamic Microbial Ecology
⁃ Hundreds to Thousands of Species
• Horizontal Gene Transfer
• Adaptive Selection Pressures (Immune System)
⁃ Innate and Adaptive Immune System
⁃ Macrophages and Antimicrobial proteins
• Constantly Changing Environmental Pressures
⁃ Diet
⁃ Antibiotics
⁃ Pharmaceuticals
The human Microbiome
The Data Challenge: size
34. Your Microbiome is
Your “Near-Body” Environment
and its Cells
Contain 200-2000x
as Many DNA Genes
As Your Human Cells
More Microbe Cells Than Human Cells
DNA-bearing Cells in Your Body
The Data Challenge: size
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
Human Genome Metagenomics
35. To Map Out the Dynamics of Autoimmune Microbiome Ecology
Couples Next Generation Genome Sequencers to Big Data Supercomputers
Source: Weizhong Li, UCSD
UCSD Team Used 25 CPU-years
to Compute
Comparative Gut Microbiomes
Starting From
2.7 Trillion DNA Bases
of Healthy and IBD Subjects samples
Illumina HiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
Using Machine Learning to Determine Major Differences
Between Gut Microbiome in Health and Disease,
Mehrdad Y. et al., IEEE International Conference on Big Data (December 5-8,
2016)
36. In a “Healthy” Gut Microbiome:
Large Taxonomy Variation, Low Protein Family Variation
Source: Nature, 486, 207-212 (2012)
Over 200 People
38. Issues
Sources
180+ Bio/related ONLINE
Databases [NAR 2016]
Custom / Proprietary data
Medical data
The Data Challenge: heterogeneity
• Many de-facto standards
• No compatibility
• Overlaps
• Validation / Quality
39. Size
Heterogeneity
Curse of dimensionality
Ownership / Ethics
Falsification
Issues
The Data Challenge
Overfitting
Many variables & Few
samples
Machine Learning issues
40. Size
Heterogeneity
Curse of dimensionality
Ownership / Ethics
Falsification
Issues
The Data Challenge
Non reproducible results
No comparable
approaches
Difficult improvements
Private ≠ Not sharable
Many data sets are not
shared
41. Size
Heterogeneity
Curse of dimensionality
Ownership / Ethics
Falsification
Issues
The Data Challenge
42. • How do a biologist demonstrates in a paper that
he/she actually performed an experiment?
Falsification
The Data Challenge: falsification
43. What is a scientific fraud?
Experiment was never performed
Experiment was performed
Fictious data
Altered data
Duplicated data
FABRICATION
FALSIFICATION
PLAGIARISM
The Data Challenge: falsification
47. Image manipulation by country
The manipulation rate is around 6% in general, around 17% for papers containing at
least 1 gel image
R² = 0.7138
0
5
10
15
20
25
0 50 100 150 200 250 300
ManipulatedPapers
Overall Examined Papers
Enrico Bucci, PhD, SCI 2017 – Paestum, September 11,
Experiment 1
1364 random papers
(5000 images) from
PMC, published in Jan
2014
48. Evolution of manipulation rate over 5 years
Increase from 6.5% to 13.1% of manipulated images content over 5 years
Submission rate increases 10 times.
20 papers were found to reuse previously published images.
Enrico Bucci, PhD, SCI 2017 – Paestum, September 11,
Experiment 2
1546 papers published
in Cell Death and
Disease (NPG) from
2010-2014
49. Manipulation frequency per
originating country
25.3%
20.6%
17.9% 16.7%
8.3% 7.1% 3.7%
% of submitted papers which were flagged (for countries
submitting at least 10 manuscripts)
Enrico Bucci, PhD, SCI 2017 – Paestum, September 11,
Country 1 Country 2 Country 3 Country 4 Country 5 Country 6 Japan
Experiment 3
Submitted manuscripts
from Jun to Aug 2017 to
5 independent journals.
20% of manipulated
images
50. WE DO
PRECISION GUESS WORK
BASED ON
(SOMETIMES) UNRELIABLE
DATA!!!
The Data Challenge
52. • De-facto standards are dangerous
The Data Challenge
What happens when
the standard is
“unilaterally” changed
by its “owner” ?
Past versions may
become unusable,
and published work
obsolete
54. • Standardization Authority (like IEEE, ISO, …) for
standardization of:
⁃ ACCESSION (Entrez GeneId, mirBASE accession, ….)
⁃ NAMING (Hugo)
⁃ DATA EXCHANGE (MIAME for Microarrays)
⁃ …
The Data Challenge
55. raw DATA MODEL
The Systems Biology
Lifecycle
The Data Challenge
SIMULATION
decomposition
and
localization
Dynamic
modeling
New Hypothesis
New biological
questions
Model refinementData recomposition
56. • Common practice (in other fields) to compare
algorithms on the same data sets (E.g. ISCAS circuits,
SPEC software)
• They increase competitiveness, comparability,
reproducibility, REUSABILITY
• There are efforts (BioPerf, BAliBASE, Affycomp), but
there is a general lacks of “benchmarking culture”.
The Data Challenge
57. raw DATA MODEL
The Systems Biology
Lifecycle
The Data Challenge
SIMULATION
decomposition
and
localization
Dynamic
modeling
New Hypothesis
New biological
questions
Model refinementData recomposition
60. “A Whole-Cell Computational Model Predicts Phenotype from Genotype”
A model of
Mycoplasma genitalium,
• 525 genes
• Using 1,900 experimental
observations from 900
studies, they created the
software model, which requires
128 computers to run.
61. KNOWLEDGE BASES vs. PREDICTION
DATA-DRIVEN vs. HYPOTHESIS-BASED
Modeling approaches
The Modelling Challenge
SYSTEMS BIOLOGY
62. Granularity
Pathways (KEGG, Reactome, Ingenuity, …)
Only Gene-2-Gene networks. Missing miRNA interactions,
TF, lncRNA, …
Scalability
Boolean Networks
2 values logic to model a continuous phenomena
(expression)
But…. Simulation complexity grows exponentially!!!
• 100 genes = 2^100 = 10^30 states ➔ N^100
The Data challenge!
Issues
The Modelling Challenge
64. • Machine Learning is the ability of computer systems to
infer their own knowledge, by extracting patterns from
raw data.
• Deep Learning (DL) avoids the need for human
operators to formally specify all of the knowledge
• DL achieves great power and flexibility by representing
the world as a hierarchy of concepts self-generated.
MODEL CHALLENGES
The Modelling Challenge
65. • Traditional ML approaches might have to be
optimized to adapt to the peculiar characteristics of
biological data (eg. curse of dimensionality)
• Often parameter-driven
• Benchmarks are needed
MODEL CHALLENGES
The Modelling Challenge
68. Imagine ...
• The number of variables is so huge that we can
easily picture parts of the landscape that look (to
us) almost identical, but may be different in
small details.
The Modelling Challenge
69. raw DATA MODEL
The Systems Biology
Lifecycle
The Modelling Challenge
SIMULATION
decomposition
and
localization
Dynamic
modeling
New Hypothesis
New biological
questions
Model refinementData recomposition
71. • Benso A.; Di Carlo S.; Politano G.; Savino A.; Bucci E.
Alice in "Bio-land": engineering challenges in the
world of Life-Sciences IT PROFESSIONAL, Vol.16,
pp.38-47, ISSN: 1520-9202,
DOI: 10.1109/MITP.2014.45
Related readings