Kernel-based machine learning methods are for predicting the structured, non-tabular data arising in biomedicine and digital health, especially applications on small molecules such as drugs and metabolites.
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
Kernel-based machine learning methods
1. Associate professor Juho Rousu
Kernel Methods, Pattern
Analysis and
Computational
Metabolomics (KEPACO)
2. KEPACO in a Nutshell
We develop kernel-based machine learning methods for predicting of
structured, non-tabular data arising in biomedicine and digital health,
especially applications on small molecules such as drugs and metabolites.
Examples of research
questions:
• How to identify a
metabolite from its
mass spectrum?
• How to find
multivariate
associations
between SNPs and
metabolites?
• How to classify a
enzyme according to
its metabolomic
role?
Flowchart for a typical study
Contact: Juho Rousu
1. Structured Big Data
1 1 11 11 10 0 0 0 0 0 0 111 1 1 1 10 0 0 0 0 0 0
0.6 0.11 0.30.06 0.020.1 0.020 0 0 0 0 0 0 0.20.30.1 0.2 0.11 0.4 0.010 0 0 0 0 0 0Nodes Intensity
(NI)
NB
NI
Nodes Binary (NB)
0 10 20 30 40 50
01020304050
Number of active cell lines
Numberofactivecelllines
0.6
0.7
0.8
0.9
1.0
2. Efficient Kernel (non-linear
similarity metric) computation
3. Definition as a Regularized learning
problem (max-margin and/or sparsity)
4. Efficient training
using convex
optimization
Saddle point
Conditional
gradient
5 10 15 20
Metlin − PubChem
TOP k
percentageofcorrectlyidentified2Dstructures 0%
20%
40%
60%
80%
100%
5 10 15 20
our method
CFM−ID
MIDAS
MetFrag
Random
5. Prediction and interpretation
3. KEPACO Mission: Predicting
Structured data
Sequences Hierarchies Networks
Many data have logical structure, how can we make use of that in
making models faster and more accurate?
In particular, we are interested of predicting structured output
4. Example: Predicting Network
Response
Problem:
• To predict the spread of
activation in a complex network,
given an action/stimulus
Example:
• Input: post in a social network
• Output: subnetwork of friends
who share the post
a
i b
h
g
f
e
c
d
Al ice
David
Bob
Jacob
Ben
Dutch get the
championship!
Dutch get the
championship!
Dutch get the
championship!Dutch get the
championship! Soccer is not
my cup of tea.
Su, Hongyu, Aristides Gionis, and Juho Rousu. "Structured prediction of
network response." Proceedings of the 31st International Conference on
Machine Learning (ICML-14). 2014.
5. Example: Metabolite identification
through machine learning
• From a set of MS/MS spectra (x = structured input), learn a
model to predict molecular fingerprints (y = multilabel output)
• With predicted fingerprints retrieve candidate molecules from a
large molecular database
• Our CSI:Fingerid (http://www.csi-fingerid.org) tool is the most
accurate metabolite identification tool to date
Dührkop, K., Shen, H., Meusel, M., Rousu, J., & Böcker, S. (2015). Searching
molecular structure databases with tandem mass spectra using CSI: FingerID.
Proceedings of the National Academy of Sciences, 112(41), 12580-12585.
Huibin Shen
May 15, 2014
8/27
New fingerprint prediction framework
5 10 15 20
percentageofcorrectlyidentifiedstructures
0%
20%
40%
60%
80%
100%
5 10 15 20
top k
our method
FingerID
CFM−ID
MAGMa
Midas
MetFrag
Random
6. Example: Drug Bioactivity
Classification
• Data: x = Molecule, y = labeling of a graph of cell lines
• Compatibility score F(x,y): how well subset of cell lines go together
with the drug
• Co-occurring features: molecular fingerprints vs. subsets of cell lines
• Prediction: finding the best scoring labeling of the graph
MCF7
MDA_MB_231
HS578T
BT_549
T47D
SF_268
SF_295
SF_539
SNB_19
SNB_75
U251
COLO205
HCC_2998
HCT_116
HCT_15
HT29
KM12
SW_620
CCRF_CEM
HL_60
K_562
MOLT_4
RPMI_8226
SR
LOXIMVI
MALME_3M
M14
SK_MEL_2
SK_MEL_28
SK_MEL_5
UACC_257
UACC_62
MDA_MB_435
MDA_N
A549
EKVX
HOP_62
HOP_92
NCI_H226
NCI_H23
NCI_H322M
NCI_H460
NCI_H522
IGROV1
OVCAR_3
OVCAR_4
OVCAR_5
OVCAR_8
SK_OV_3
NCI_ADR_RES
PC_3DU_145
786_0
A498
ACHN
CAKI_1
RXF_393
TK_10
UO_31
Su, Hongyu, and Juho Rousu. "Multilabel classification through random
graph ensembles." Machine Learning 99.2 (2013): 231-256.
7. Example: Multivariate association
meta-analysis
• Problem: find associations
btw. groups of SNPs
(=genotype) and groups of
metabolites (=phenotype)
• Meta-analysis setting
• Several studies analyzed
together
• No original data available, only
summary statistics of indivudual
studies
• Our metaCCA tool can recover
multivariate associations
nearly as accurately as if
original data was available
2.10.2015
Cichonska A, Rousu J, Marttinen P, Kangas AJ, Soininen P, Lehtimäki T,
Raitakari OT, Järvelin MR, Salomaa V, Ala-Korpela M, Ripatti S. metaCCA:
Summary statistics-based multivariate meta-analysis of genome-wide
association studies using canonical correlation analysis. bioRxiv. 2015 Jan
1:022665.
8. Example: Biological network
reconstruction
Input: genomic measurements Output: biological network
Pitkänen E, Jouhten P, Hou J, Syed MF, Blomberg P, Kludas J, Oja M, Holm L,
Penttilä M, Rousu J, Arvas M. Comparative genome-scale reconstruction of gapless
metabolic networks for present and ancestral species. PLoS Comput Biol. 2014 Feb
1;10(2):1003465.
9. Example: Synthetic biology
• Goal: energy and carbon efficient
industrial processes through
synthetic biology
• Consortium: VTT, Aalto, U. Turku
Our focus
• In silico design, modelling and
optimization of synthetic biosystems
• Systems Modelling + Machine
Learning
LiF
Fermentation
experiments
Genome-wide
measurements
In silico
experimental
design
Strain/Process
modification
Tekes Strategic Opening “LiF – Living Factories”, 2014-
10. Kernel Methods, Pattern Analysis &
Computational Metabolomics (KEPACO)
KEPACO research group
Dr. Sandor Szedmak
Dr. Sahely Bhadra (w. S.Kaski)
Dr. Markus Heinonen (w. H. Lähdesmäki)
Dr. Celine Brouard
Dr. Elena Czeizler
Dr. Hongyu Su
MSc Huibin Shen
MSc Anna Cichonska (HIIT&FIMM)
MSc Viivi Uurtio
+ students
Funding
• Academy of Finland
• EU FP7
• Tekes
KEPACO group summer 2015