SlideShare a Scribd company logo
The interplay between data-driven and 

theory-driven methods for chemical sciences
Ichigaku Takigawa
• Medical-risk Avoidance based on iPS Cells Team,

RIKEN Center for Advanced Intelligence Project (AIP)
• Institute for Chemical Reaction Design and Discovery
(WPI-ICReDD), Hokkaido University
ichigaku.takigawa@riken.jp
Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
From AAAI-20 Oxford-Style Debate
Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
Machine Learning with "Discrete Structures"
Q. How we can make use of "discrete structures" for prediction...?
sets, sequences, branchings or hiearchies (trees), networks (graphs), relations,
logic, rules, combinations, permutations,, point clouds, algebra, langages, ...
H
H
H
H
H
H
H
H
O
N
O
O
H
H
H
O
O
H
H
N
O
O
Cl
ClCl
• The target variables can come with "discrete structures"
• The relationship between variables can have "discrete structures"
• The ML models themselves can involve "discrete structures"
Past work: Data-intensive approaches to life science
• Transcriptional regulation of metabolism

Bioinformatics 2007, 2008a, 2008b, 2009, 2010

Nucleic Acids Res 2011, PLoS One 2012, 2013, KDD'07
• Transcription regulation by mediator complex

Nat Commun 2015, Nat Commun 2020
• Repetitive sequences in genomes

Discrete Appl Math 2013, 2016, AAAI 2020
• Polypharmacology of drug-target interaction networks

PLoS One 2011, Drug Discov Today 2013, Brief Bioinform 2014, BMC
Bioinformatics 2020
• Copy number variations in neurological disorder (MSA, MS)

Mol Brain 2017
• Substrate analysis of modulator protease (Calpain family)

Mol Cell Proteom 2016, Genome Informatics 2009, http://calpain.org
• Cell competition in cancer

Cell Reports 2018, Sci Rep 2015
• Phenotype patterns of somatic mutations in breast cancer

Brief Bioinform 2014
Recent work: Data-intensive approaches to chemistry
Machine learning for heterogeneous catalysis
Haber–Bosch Process
Ferrous Metal Catalysis
NOx
CO
HC
N2
CO2
H2O
Exhaust Gas Harmless gas
Noble Metal Catalysis (Pt, Pd, Rh…)
• Ethane
• Ethylene
• Methanol
• :
Methane
Various Metallic Catalysts

(Li, rare earthes, alkaline earths)
(industrial synthesis of ammonia)
Exhaust Gas Purification Conversion of Methane
“Fertilizer from Air”
artificial nitrogen fixation
• ACS Catalysis 2020 (review)
• ChemCatChem 2019 (front cover)
• J Phys Chem C 2018a, 2018b, 2019 (cover)
• RSC Advances 2016 (highlighted in Chemical World)
catalysis where the phase of the catalyst differs from
the phase of the reactants or products.
Reactants in
the gas phase
Catalysts in

the solid phase
Effective use of data is another key in natural sciences
REVIEW
Inverse molecular design using
machine learning: Generative models
for matter engineering
Benjamin Sanchez-Lengeling1
and Alán Aspuru-Guzik2,3,4
*
The discovery of new materials can bring enormous societal and technological progress. In this
context, exploring completely the large space of potential materials is computationally
intractable. Here, we review methods for achieving inverse design, which aims to discover
tailored materials from the starting point of a particular desired functionality. Recent advances
from the rapidly growing field of artificial intelligence, mostly from the subfield of machine
learning, have resulted in a fertile exchange of ideas, where approaches to inverse molecular
design are being proposed and employed at a rapid pace. Among these, deep generative models
have been applied to numerous classes of materials: rational design of prospective drugs,
synthetic routes to organic compounds, and optimization of photovoltaics and redox flow
batteries, as well as a variety of other solid-state materials.
M
any of the challenges of the 21st century
(1), from personalized health care to
energy production and storage, share a
common theme: materials are part of
the solution (2). In some cases, the solu-
mapped 166.4 billion molecules that contain at
most 17 heavy atoms. For pharmacologically rele-
vant small molecules, the number of structures is
estimated to be on the order of 1060
(9). Adding
consideration of the hierarchy of scale from sub-
act properties. In practice, approximations are
used to lower computational time at the cost of
accuracy.
Although theory enjoys enormous progress,
now routinely modeling molecules, clusters, and
perfect as well as defect-laden periodic solids, the
size of chemical space is still overwhelming, and
smart navigation is required. For this purpose,
machine learning (ML), deep learning (DL), and
artificial intelligence (AI) have a potential role
to play because their computational strategies
automatically improve through experience (11).
In the context of materials, ML techniques are
often used for property prediction, seeking to
learn a function that maps a molecular material
to the property of choice. Deep generative models
are a special class of DL methods that seek to
model the underlying probability distribution of
both structure and property and relate them in a
nonlinear way. By exploiting patterns in massive
datasets, these models can distill average and
salient features that characterize molecules (12, 13).
Inverse design is a component of a more
complex materials discovery process. The time
scale for deployment of new technologies, from
discovery in a laboratory to a commercial pro-
duct, historically, is 15 to 20 years (14). The pro-
cess (Fig. 1) conventionally involves the following
steps: (i) generate a new or improved material
FRONTIERS IN COMPUTATION
httpDownloadedfrom
REVIEW https://doi.org/10.1038/s41586-018-0337-2
Machine learning for molecular and
materials science
Keith T. Butler1
, Daniel W. Davies2
, Hugh Cartwright3
, Olexandr Isayev4
* & Aron Walsh5,6
*
Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning
techniques that are suitable for addressing research questions in this domain, as well as future directions for the field.
We envisage a future in which the design, synthesis, characterization and application of molecules and materials is
accelerated by artificial intelligence.
T
he Schrödinger equation provides a powerful structure–
property relationship for molecules and materials. For a given
spatial arrangement of chemical elements, the distribution of
electrons and a wide range of physical responses can be described. The
development of quantum mechanics provided a rigorous theoretical
foundationforthechemicalbond.In1929,PaulDiracfamouslyproclaimed
that the underlying physical laws for the whole of chemistry are “completely
known”1
. John Pople, realizing the importance of rapidly developing
generating, testing and refining scientific models. Such techniques are
suitable for addressing complex problems that involve massive combi-
natorial spaces or nonlinear processes, which conventional procedures
either cannot solve or can tackle only at great computational cost.
As the machinery for artificial intelligence and machine learning
matures, important advances are being made not only by those in main-
stream artificial-intelligence research, but also by experts in other fields
(domain experts) who adopt these approaches for their own purposes. As
DNA to be sequences into distinct pieces,
parcel out the detailed work of sequencing,
and then reassemble these independent ef-
forts at the end. It is not quite so simple in the
world of genome semantics.
Despite the differences between genome se-
quencing and genetic network discovery, there
are clear parallels that are illustrated in Table 1.
In genome sequencing, a physical map is useful
to provide scaffolding for assembling the fin-
ished sequence. In the case of a genetic regula-
tory network, a graphical model can play the
same role. A graphical model can represent a
high-level view of interconnectivity and help
isolate modules that can be studied indepen-
dently. Like contigs in a genomic sequencing
project, low-level functional models can ex-
plore the detailed behavior of a module of genes
in a manner that is consistent with the higher
level graphical model of the system. With stan-
dardized nomenclature and compatible model-
ing techniques, independent functional models
can be assembled into a complete model of the
cell under study.
To enable this process, there will need to
be standardized forms for model representa-
tion. At present, there are many different
modeling technologies in use, and although
models can be easily placed into a database,
they are not useful out of the context of their
specific modeling package. The need for a
standardized way of communicating compu-
tational descriptions of biological systems ex-
tends to the literature. Entire conferences
have been established to explore ways of
mining the biology literature to extract se-
mantic information in computational form.
Going forward, as a community we need
to come to consensus on how to represent
what we know about biology in computa-
tional form as well as in words. The key to
postgenomic biology will be the computa-
tional assembly of our collective knowl-
edge into a cohesive picture of cellular and
organism function. With such a comprehen-
sive model, we will be able to explore new
types of conservation between organisms
and make great strides toward new thera-
peutics that function on well-characterized
pathways.
References
1. S. K. Kim et al., Science 293 , 2087 (2001).
2. A. Hartemink et al., paper presented at the Pacific
Symposium on Biocomputing 2000, Oahu, Hawaii, 4
to 9 January 2000.
3. D. Pe’er et al., paper presented at the 9th Conference
on Intelligent Systems in Molecular Biology (ISMB),
Copenhagen, Denmark, 21 to 25 July 2001.
4. H. McAdams, A. Arkin, Proc. Natl. Acad. Sci. U.S.A.
94 , 814 ( 1997 ).
5. A. J. Hartemink, thesis, Massachusetts Institute of
Technology, Cambridge (2001).
V I E W P O I N T
Machine Learning for Science: State of the
Art and Future Prospects
Eric Mjolsness* and Dennis DeCoste
Recent advances in machine learning methods, along with successful
applications across a wide variety of fields such as planetary science and
bioinformatics, promise powerful new tools for practicing scientists. This
viewpoint highlights some useful characteristics of modern machine learn-
ing methods and their relevance to scientific applications. We conclude
with some speculations on near-term progress and promising directions.
Machine learning (ML) (1) is the study of
computer algorithms capable of learning to im-
prove their performance of a task on the basis of
their own previous experience. The field is
closely related to pattern recognition and statis-
tical inference. As an engineering field, ML has
become steadily more mathematical and more
successful in applications over the past 20
years. Learning approaches such as data clus-
tering, neural network classifiers, and nonlinear
regression have found surprisingly wide appli-
cation in the practice of engineering, business,
and science. A generalized version of the stan-
dard Hidden Markov Models of ML practice
have been used for ab initio prediction of gene
structures in genomic DNA (2). The predictions
correlate surprisingly well with subsequent
gene expression analysis (3). Postgenomic bi-
ology prominently features large-scale gene ex-
pression data analyzed by clustering methods
(4), a standard topic in unsupervised learning.
Many other examples can be given of learning
and pattern recognition applications in science.
Where will this trend lead? We believe it will
lead to appropriate, partial automation of every
element of scientific method, from hypothesis
generation to model construction to decisive
experimentation. Thus, ML has the potential to
amplify every aspect of a working scientist’s
progress to understanding. It will also, for better
or worse, endow intelligent computer systems
with some of the general analytic power of
scientific thinking.
creating hypotheses, testing by decisive exper-
iment or observation, and iteratively building
up comprehensive testable models or theories is
shared across disciplines. For each stage of this
abstracted scientific process, there are relevant
developments in ML, statistical inference, and
pattern recognition that will lead to semiauto-
matic support tools of unknown but potentially
broad applicability.
Increasingly, the early elements of scientific
method—observation and hypothesis genera-
tion—face high data volumes, high data acqui-
sition rates, or requirements for objective anal-
ysis that cannot be handled by human percep-
tion alone. This has been the situation in exper-
imental particle physics for decades. There
automatic pattern recognition for significant
events is well developed, including Hough
transforms, which are foundational in pattern
recognition. A recent example is event analysis
for Cherenkov detectors (8) used in neutrino
oscillation experiments. Microscope imagery in
cell biology, pathology, petrology, and other
fields has led to image-processing specialties.
So has remote sensing from Earth-observingMachine Learning Systems Group, Jet Propulsion Lab-
Table 1. Parallels between genome sequencing
and genetic network discovery.
Genome
sequencing
Genome semantics
Physical maps Graphical model
Contigs Low-level functional
models
Contig
reassembly
Module assembly
Finished genome
sequence
Comprehensive model
C O M P U T E R S A N D S C I E N C E
onAugust29,2018http://science.sciencemag.org/Downloadedfrom
Nature, 559

pp. 547–555 (2018)
Science, 293
pp. 2051-2055 (2001)
Science, 361
pp. 360-365 (2018)
Science is changing, the tools of science are changing. And
that requires different approaches. ── Erich Bloch, 1925-2016
(alongside experiments and simulations)
And please keep in mind that unplanned data collection is too risky.
We need right designs for data collection and right tools to analyze.
A bitter lesson: "low input, high throughput, no output science." (Sydney Brenner)
Today's AI has stark limitations, facing big problems
• Deep learning techniques thus far have proven to be data hungry, shallow,
brittle, and limited in their ability to generalize (Marcus, 2018)
• Current machine learning techniques are data-hungry and brittle—they can
only make sense of patterns they've seen before. (Chollet, 2020)
• A growing body of evidence shows that state-of-the-art models learn to exploit
spurious statistical patterns in datasets... instead of learning meaning in the
flexible and generalizable way that humans do. (Nie et al., 2019)
• Current machine learning methods seem weak when they are required to
generalize beyond the training distribution, which is what is often needed in
practice. (Bengio et al., 2019)
Though it's extremely powerful and useful in suitable applications!
The AI/ML community is now struggling with these...
• The Winograd Schema Challenge (Levesque et al., 2011)
A collection of 273 multiple-choice problems that requires commonsense
reasoning to solve as an alternative to Turing test (Turing, 1950).
Many methods passed this test of "Imitation Games" were
actually tricks being fairly adept at fooling humans...
Unexpectedly, it turned out that several tricks using BERT/Transformer
can crack many of these (with >90% accuracy) ...
• WinoGrande (Sakaguchi et al., 2020)
An improved collection of 44k very hard problems

(AAAI-20 Outstanding Paper Award) 

Though some top-ranking methods can have 70-80% accuracy ...
• Abstraction and Reasoning Challenge at Kaggle (Chollet, 2020)
"Kaggle’s most important AI competition" "Impossible AI Challenge"
At the moment, to what extent we can create an AI capable of solving
reasoning tasks it has never seen before?
Examples
WSC-273
WinoGrande-1.1
(Q0) The town councillors refused to give the angry demonstrators
a permit because they feared violence. Who feared violence?
• the town councillors
• the angry demonstrators
(Q1) The trophy would not fit in the brown suitcase because it was
so small. What was so small?
• the trophy
• the brown suitcase
(due to Terry Winograd)
(Q) He never comes to my home, but I always go to his house
because the ______ is smaller.
1. home
2. house
Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.

We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.

We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts

We need a good strategy for building a gigantic data collection for this,

as well as self-supervised learning and/or meta learning algorithms.

"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.

We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts

We need a good strategy for building a gigantic data collection for this,

as well as self-supervised learning and/or meta learning algorithms.

"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and

their flexible transfer

We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.

We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts

We need a good strategy for building a gigantic data collection for this,

as well as self-supervised learning and/or meta learning algorithms.

"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and

their flexible transfer

We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
They are trantitions from a valley to a hill
to another valley, but solving a quantum

chemical equation is needed at every
point to get the energy surface
Chemistry has the first principle at the electron level
θ1
θ2
Schrödinger equation
Potential Energy Surface (PES)
Outcomes we see are just

discrete and combinatorialstable state 1 transition state stable state 2
Energy
Every substance is a combination of
only 118 elements in the periodic table.
Our end goal:
take full control over chemical reactions,

the basis for transforming substances to
another (energy, materials, foods, colors, ...)
Recombinations of atoms and
chemical bonds are subjected to
the laws of nature
Our core technology for automated reaction-path search
ADDF

Ohno & Maeda, Chem
Phys Lett, 2004
AFIR
Maeda & Morokuma, J
Chem Phys, 2010
But chemistry still remains quite empirical, and why?
Even though current chemical calculations are fairly accurate, 

chemists would still heavily rely on "Edisonian empiricism (trial & error)."
1. empirical knowledge (from the literature) and their flexible transfer
2. intuitions and experiences (unverbalizable common sense of experts)
But chemistry still remains quite empirical, and why?
Even though current chemical calculations are fairly accurate, 

chemists would still heavily rely on "Edisonian empiricism (trial & error)."
1. empirical knowledge (from the literature) and their flexible transfer
2. intuitions and experiences (unverbalizable common sense of experts)
Takeaway: We also need 'data-driven' bridges!

first principles are not enough for us to throw away these empirical
things; data-driven approaches (ML) play a complementary role!
Fusing multi-disciplinary methods...?
1. Theory-driven: (slow but) high-fedelity quantum-chemical simulations
2. Knowledge-driven: explicit knowns and inference
3. Data-driven: empirical prediction grounded upon multifaceted data
Indeed computational chemistry has many limitations...
Chemical reactions = recombinations of atoms and chemical bonds

subjected to the laws of nature
• Intractably large chemical space: A intractably large number of
"theoretically possible" candidates for reactions and compounds...
• Computing time and scalability issue: Simulating an Avogadro-constant
number of atoms is utterly infeasible... (we need some compromise here)
• Complexity and uncertainty of real-world systems: 

Many uncertain factors and arbitrary parameters are involved...
• Known and unknown imperfections of currently established theories:
Current theoretical calculations have many exceptions and limitations...
Yet another approach: Data-driven
Cause-and-Effect

Relationship
Related factors

(and their states)
Outcome
Reactions
Some
mechanism
[Inputs] [Outputs]
Theory-driven methods try to explicitly model the inner workings
of a target phenomenon (e.g. through first-principles simulations)
Data-driven methods try to precisely approximate its outer behavior
(the input-output relationship) observable as "data". 

(e.g. through machine learning from a large collection of data)
based on very different principles and quite complementary!
governing equation?
Machine Learning (ML)
Generic Object Recognition
Speech Recognition
Machine Translation
QSAR/QSPR Prediction
AI Game Players
“ありがとう”
J’aime la
musique I love music
CH3
N
N
H
N
H
H3C
N
Growth inhibition
0.739
A new style of programming
a technique to reproduce a transformation process (or function) where
the underlying principle is unclear and hard to be explicitly modelled 

just by giving a lot of input-output examples.
How ML works: fitting a function to data
Inputs Outputs
ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
A function best fitted to
a given set of example input-output pairs
(the training data).
(x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit>
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
interpolative
prediction
(High-dimensional)
HighLow Model Complexity
Underfitting
(High bias, Low variance)
Overfitting
(Low bias, High variance)
"The bias-variance tradeoff"
The training data
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
extrapolative
prediction
How ML works: fitting a function to data
Inputs Outputs
ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
A function best fitted to
a given set of example input-output pairs
(the training data).
(x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit>
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
interpolative
prediction
(High-dimensional)
HighLow Model Complexity
Underfitting
(High bias, Low variance)
Overfitting
(Low bias, High variance)
"The bias-variance tradeoff"
The training data
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
extrapolative
prediction
e.g.) Deep learning usually comes with overparametrization
and the number of parameters can be several tens of millions
or even billions... (This is why model-complexity control or
regularization is so critical and essential in ML)
This can be often ultra high-dimensional in modern ML
Multilevel representations of chemical reactions
Brc1cncc(Br)c1 C[O-] CN(C)C=O Na+ COc1cncc(Br)c1SMILES
Structural Formla
Steric Structures
Electronic States
Reactants Reagents Products
As pattern languages (e.g. known facts in textbooks/databases)
As physical entities (e.g. quantum chemical calculations)
Computer-assisted synthetic planning
Corey+ 1972
J Am Chem Soc (JACS), 94(2), 1972.
440
Computer-Assisted Synthetic Analysis for Complex Molecules.
Methods and Procedures for Machine Generation of
Synthetic Intermediates
E. J. Corey,* Richard D. Cramer III, and W. Jeffrey Howe
Contribution from the Department of Chemistry, Harvard University,
Cambridge, Massachusetts 02138. Received January 30, 1971
Abstract: A classification of synthetic reactions is outlined which is suitable for use in a machine program to
generate a tree of synthetic intermediates starting from a given target molecule. The generation of a particular
intermediate by the program involves the search of appropriate data tables of synthetic processes, the search being
driven by the information obtained by machine perception of the parent structure and certain basic strategies.
Procedures have been developed for the evaluation of chemical interconversions which allow the effective exclusion
of invalid or naive structures. The paper provides a view of the status of computer-assisted synthetic problem
solving as of 1970.
The
communication of chemical structural informa-
tion to and from a digital computer by graphical
methods has been discussed in detail in a foregoing
paper,1 as has the machine representation and percep-
tion of key features within structures,2 as for example,
functional groups and rings. This paper is concerned
with the ways in which the structural information made
available by the perception process can be utilized to
generate a tree of chemical structures3 which represent
possible synthetic intermediates for the construction of
a complex target molecule. More specifically, the
following topics will be treated: (1) classification of
and necessary control strategies, and also for eventual
inclusion of a fairly complete collection of families.
In the discussion which follows, the degree of imple-
mentation of each area of study will be cited.
A variety of rational schemes for creating families of
synthetic reactions already exists. However, most of
these depend on properties of the reactants,5 and as
such they are irrelevant to a computer program which
analyzes the features of a target or product molecule
in order to generate appropriate starting materials.
One very general treatment of synthetic reactions in-
volves classification on the basis of the structural
Computer-Assisted Solution of Chemical Problems-
The Historical Development and the Present State of the Art
of a New Discipline of Chemistry
By Ivar Ugi," Johannes Bauer, Klemens Bley, Alf Dengler, Andreas Dietz,
Eric Fontain, Bernhard Gruber, Rainer Herges, Michael Knauer, Klaus Reitsam,
and Natalie Stein
Dedicated to Projkssor Karl-Heinz Biicliel
The topic of this article is the development and the present state of the art of computer
chemistry, the computer-assisted solution of chemical problems. Initially the problems in
computer chemistry were confined to structure elucidation on the basis of spectroscopic data,
then programs for synthesis design based on libraries of reaction data for relatively narrow
classes of target compounds were developed, and now computer programs for the solution of
a great variety of chemical problems are available or are under development. Previously it was
an achievement when any solution of a chemical problem could be generated by computer
assistance. Today, the main task is the efficient, transparent, and non-arbitrary selection of
meaningful results from the immense set of potential solutions--that also may contain innova-
tive proposals. Chemistry has two aspects, constitutional chemistry and stereochemistry,
which are interrelated, but still require different approaches. As a result, about twenty years
ago, an algebraic model of the logical structure of chemistry was presented that consisted of
two parts: the constitution-oriented algebra of be- and r-matrices, and the theory of the
Ugi+ 1993
Angew Chem Int Ed Engl. 32, 202-227, 1993.
A traditional chemoinformtics topic since 70s:

collect known chemical reactions, and search for reaction paths
e.g.) Representation, classification, exploration of chemical reactions
Knowledge-based approaches
Widely-used proprietary systems

(template-based path search and expansions)
Now towards fusing with Machine Learning...?
Purely ML-based approaches and more
ML-based chemical reaction predictions
Fermionic Neural Network
Pfau+ Ab-Initio Solution of the Many-Electron Schrödinger Equation with Deep Neural Networks. 

arXiv:1909.02487, Sep 2019.
Hamiltonian Graph Networks with ODE Integrators
Sanchez-Gonzalez+ Hamiltonian Graph Networks with ODE Integrators. 

arXiv:1909.12790, Sep 2019.
Both from
ML + First-principle simulations
3N-MCTS/AlphaChem

Segler+ Nature 2018
Molecular Transformer

Schwaller+ ACS Cent Sci
2019
seq2seq

Liu+ ACS Cent Sci 2017
WLDN

Jin+ NeurIPS 2017
ELECTRO

Bradshaw+ICLR 2019
WLN

Coley+ Chem Sci 2019
GPTN

Do+ KDD 2019
Graph Neural Networks Sequence Neural Networks Combined or Other
IBM RXN

Schwaller+ Chem Sci 2018
Molecule Chef

Bradshaw+ DeepGenStruct

(ICLR WS) 2019
Neural-Symbolic ML

Segler+ Chemistry 2017
Similarity-based

Coley+ ACS Cent Sci 2017
GLN

Dai+ NeurIPS2019
Transformer

Karpov+ ICANN 2019
Similar technologies go for biology!
DeepMind's AlphaFold paper
A successful example from MIT MLPDS
How to integrate data-driven and theory-driven?
• Theory-driven: first-principles simulations, logical reasoning, 

mathematical models, etc.
• Data-driven: machine learning
Still emerging topics, 

but examples of already available approaches are
1. Wrapper: Use ML to control or plan simulations

Basically we solve the problem by simulations, but use ML models to
ask "what if?" questions to guide what simulations to run.

(Model-based optimization, Sequential design of experiments,
Reinforcement learning, Generative methods, etc)
2. Hybrid: Use ML as approximator for unsure parts of simulations

Plugging ML models into the unsure part of simulations or

calling simulations when ML predictions are less confident

(Data assimilation, domain adaptation, semi-empirical methods, etc)
Back to our example: Heterogeneous Catalysis
Wolfgang Pauli
“God made the bulk; 

the surface was invented by the devil.”
adsorption
diffusion
desorption
dissociation
recombination
kinks
terraces
adatom
vacancysteps
Gas-Phase
(Reactants)
Sold-Phase
(Catalysts)
Many hard-to-quantify intertwined factors involved.
Too complicated (impossible?) to model everything...
• multiple elementary
reaction processes
• composition, support,
surface termination,
particle size, particle
morphology, atomic
coordination environment
• reaction conditions
Notoriously complex surface reactions between different phases.
Our ML-based case studies
1. Can we predict the d-band center?
2. Can we predict the adsorption energy?
3. Can we predict the catalytic activity?
predicting DFT-calculated values by machine learning
  (Takigawa et al, RSC Advances, 2016)
predicting DFT-calculated values by machine learning
  (Toyao et al, JPCC, 2018)
predicting values from experiments reported in the 
literature by machine learning
  (Suzuki et al, ChemCatChem, 2019)
• Problem: Very strong "selection bias" in existing datasets
Catalyst research has relied heavily on prior published data,
strongly biased toward catalyst composition that were successful
One of big problems we had
Example) Oxidative coupling of methane (OCM)
• 1868 catalysts in the original dataset [Zavyalova+ 2011]
• Composed of 68 different elements: 61 cations and 7 anions
(Cl, F, Br, B, S, C, and P) excluding oxygen
• Occurrences of only a few elements such as La, Ba, Sr, Cl,
Mn, and F are very high.
• Widely used elements such as Li, Mg, Na, Ca, and La also
frequent in the data
An ML model is just representative of the training data
Highly Inaccurate Model Predictions
from Extrapolation (Lohninger 1999)
"Beware of the perils of extrapolation,
and understand that ML algorithms
build models that are representative of
the available training samples."
Simply focusing on targets predicted as high-performance ones
by ML built on currently avialable data is clearly not a good idea...
Like too much scrutinizing any so-so candidates at the moment
that we stumbled upon at the very early stage of research...
An ML model is just representative of the training data
Highly Inaccurate Model Predictions
from Extrapolation (Lohninger 1999)
"Beware of the perils of extrapolation,
and understand that ML algorithms
build models that are representative of
the available training samples."
Simply focusing on targets predicted as high-performance ones
by ML built on currently avialable data is clearly not a good idea...
Like too much scrutinizing any so-so candidates at the moment
that we stumbled upon at the very early stage of research...
In reality, distinction between
interpolation and extrapolation
are not that clear due to the
high-dimensionality.
No guarantee of data-driven for the outside of given data
Cause-and-Effect

Relationship
Related factors

(and their states)
Outcome
Reactions
Some
mechanism
[Inputs] [Outputs]
Data-driven methods try to precisely approximate its outer behavior
(the input-output relationship) observable as "data". 

(e.g. through machine learning from a large collection of data)
Keep in mind: Given data DEFINES the data-driven prediction!
"Current machine learning techniques are data-hungry and brittle
—they can only make sense of patterns they've seen before." 

(Chollet, 2020)
Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation: 

what we already know and get something close to what we expect
• Exploration:

something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction

(function fitting)
amount 

we want to
maximize
max(best)
for now
Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation: 

what we already know and get something close to what we expect
• Exploration:

something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction

(function fitting)
prediction variance
(uncertainty)
amount 

we want to
maximize
max(best)
for now
Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation: 

what we already know and get something close to what we expect
• Exploration:

something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction

(function fitting)
prediction variance
(uncertainty)
amount 

we want to
maximize
max(best)
for now
very confident
(we have data)
not confident
(no data around)
Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation: 

what we already know and get something close to what we expect
• Exploration:

something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
e.g.

"expected
improvement"
ML prediction

(function fitting)
prediction variance
(uncertainty)
amount 

we want to
maximize
max(best)
for now
Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation: 

what we already know and get something close to what we expect
• Exploration:

something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
e.g.

"expected
improvement"
ML prediction

(function fitting)
prediction variance
(uncertainty)
probing area with no-data or scarse data relying on available data
amount 

we want to
maximize
max(best)
for now
AlphaGo

(Nature, Jan 2016)
AlphaGo Zero

(Nature, Oct 2017)
AlphaZero

(Science, Dec 2018)
• Algorithm Configuration
• Hyperparameter Optimization (HPO)
• Neural Architecture Search (NAS)
• Meta Learning / Learning to Learn Amazon
SageMaker
MuZero

(arXiv, Nov 2019)
Key: model-based and exploitation-exploration balance
Model-based reinforcement learning
AutoML (Use ML for tuning ML)
Facts from experiments and calculations
Rationalize & Accelerate Chemical Design and Discovery
In-House data + Public data + Knowledge base

(and their quality control & annotations)
Hypothesis generation Validation
(Experiments and/or Simulations)(Machine learning, Data mining)
• Planning what to test in the next
experiments or simulations
• Surrogates to expensive or time-
consuming experiments or simulations
• Optimize uncertain factors or conditions
• Multilevel information fusion
• Highly Reproducible experiments with
high accuracies and speeds
• Acceleration with ML-based surrogates
for time-consuming subproblems
• Simulating many 'what-if' situations
Key: Effective use of data with a help with data-driven techniques
This trend emerged first in life sciences (drug discovery)
Next expanded to materials science
Little human intervention for highly
reproducible large-scale production lines
Automation, monitoring with IoT, and big-data
management are also the key to manufacturing.
Now these focuses shifted to the R & D phases.
(very experimental and empirical traditionally)
Next expanded to materials science
Little human intervention for highly
reproducible large-scale production lines
Automation, monitoring with IoT, and big-data
management are also the key to manufacturing.
Now these focuses shifted to the R & D phases.
(very experimental and empirical traditionally)
... also to chemistry!
Summary
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.

We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation
• Needs for modeling unverbalizable comon sense of domain experts

We need a good strategy for building a gigantic data collection for this,

as well as self-supervised learning and/or meta learning algorithms.



• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and

their flexible transfer

We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
Reviewer: 1

I don't usually recommend that papers should be accepted "as is", but in this case I don't see
the need for changes. This review should be accepted and published in ACS Catalysis. ... I will
certainly recommend it to my group and my students when it is published. 

Reviewer: 2

The manuscript gives an excellent over the field of machine learning especially with regard to
heterogeneous catalysis and I would highly recommend the article for the publication in ACS
Catalysis.

Reviewer: 3

This is one of the best reviews for catalyst informatics that the Reviewer has read. In particular,
the chapter 2 delivers a very good tutorial, which is concisely and professionally written.
Chapter 2 is the general user's guide of ML for natural sciences.
ACS Catalysis, 2020; 10: 2260-2297.

More Related Content

What's hot

機械学習は化学研究の"経験と勘"を合理化できるか?
機械学習は化学研究の"経験と勘"を合理化できるか?機械学習は化学研究の"経験と勘"を合理化できるか?
機械学習は化学研究の"経験と勘"を合理化できるか?
Ichigaku Takigawa
 
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...
Ichigaku Takigawa
 
道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際
Ichigaku Takigawa
 
Scikit-learn : Machine Learning in Python
Scikit-learn : Machine Learning in PythonScikit-learn : Machine Learning in Python
Scikit-learn : Machine Learning in PythonAjay Ohri
 
(2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから
(2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから (2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから
(2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから
Ichigaku Takigawa
 
The MGI and AI
The MGI and AIThe MGI and AI
The MGI and AI
aimsnist
 
Machine Learning for Molecular Graph Representations and Geometries
Machine Learning for Molecular Graph Representations and GeometriesMachine Learning for Molecular Graph Representations and Geometries
Machine Learning for Molecular Graph Representations and Geometries
Ichigaku Takigawa
 
Case Study: Caltech 'Orchid' Fundamental Research Project
Case Study: Caltech 'Orchid' Fundamental Research ProjectCase Study: Caltech 'Orchid' Fundamental Research Project
Case Study: Caltech 'Orchid' Fundamental Research ProjectSociotechnical Roundtable
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
ijtsrd
 
Artificial Intelligence in Weed Recognition Tasks
Artificial Intelligence in Weed Recognition TasksArtificial Intelligence in Weed Recognition Tasks
Artificial Intelligence in Weed Recognition Tasks
Associate Professor in VSB Coimbatore
 
人工知能の基本問題:これまでとこれから
人工知能の基本問題:これまでとこれから人工知能の基本問題:これまでとこれから
人工知能の基本問題:これまでとこれから
Ichigaku Takigawa
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
 
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...
YogeshIJTSRD
 
Intellectual work of patanjali kashyap
Intellectual work of patanjali kashyap Intellectual work of patanjali kashyap
Intellectual work of patanjali kashyap
Patanjali Kashyap
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Nathan Frey, PhD
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012
Elif Ceylan
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
aimsnist
 
012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-ML012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-MLJeremy Hadidjojo
 
2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇butest
 

What's hot (20)

機械学習は化学研究の"経験と勘"を合理化できるか?
機械学習は化学研究の"経験と勘"を合理化できるか?機械学習は化学研究の"経験と勘"を合理化できるか?
機械学習は化学研究の"経験と勘"を合理化できるか?
 
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...
 
道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際
 
Scikit-learn : Machine Learning in Python
Scikit-learn : Machine Learning in PythonScikit-learn : Machine Learning in Python
Scikit-learn : Machine Learning in Python
 
(2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから
(2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから (2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから
(2021.10) 機械学習と機械発見 データ中心型の化学・材料科学の教訓とこれから
 
The MGI and AI
The MGI and AIThe MGI and AI
The MGI and AI
 
Machine Learning for Molecular Graph Representations and Geometries
Machine Learning for Molecular Graph Representations and GeometriesMachine Learning for Molecular Graph Representations and Geometries
Machine Learning for Molecular Graph Representations and Geometries
 
Case Study: Caltech 'Orchid' Fundamental Research Project
Case Study: Caltech 'Orchid' Fundamental Research ProjectCase Study: Caltech 'Orchid' Fundamental Research Project
Case Study: Caltech 'Orchid' Fundamental Research Project
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
 
Artificial Intelligence in Weed Recognition Tasks
Artificial Intelligence in Weed Recognition TasksArtificial Intelligence in Weed Recognition Tasks
Artificial Intelligence in Weed Recognition Tasks
 
人工知能の基本問題:これまでとこれから
人工知能の基本問題:これまでとこれから人工知能の基本問題:これまでとこれから
人工知能の基本問題:これまでとこれから
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...
 
Presentation1
Presentation1Presentation1
Presentation1
 
Intellectual work of patanjali kashyap
Intellectual work of patanjali kashyap Intellectual work of patanjali kashyap
Intellectual work of patanjali kashyap
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
 
Cao report 2007-2012
Cao report 2007-2012Cao report 2007-2012
Cao report 2007-2012
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-ML012517 ResumeJH Amex DS-ML
012517 ResumeJH Amex DS-ML
 
2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇2005年EI收录浙江财经学院论文7篇
2005年EI收录浙江财经学院论文7篇
 

Similar to The interplay between data-driven and theory-driven methods for chemical sciences

Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
remAYDOAN3
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster
 
kakkar2021.pdf
kakkar2021.pdfkakkar2021.pdf
kakkar2021.pdf
karitoIsa2
 
Future Directions in Chemical Engineering and Bioengineering
Future Directions in Chemical Engineering and BioengineeringFuture Directions in Chemical Engineering and Bioengineering
Future Directions in Chemical Engineering and Bioengineering
Ilya Klabukov
 
Skoltech: take a step into the future
Skoltech: take a step into the futureSkoltech: take a step into the future
Skoltech: take a step into the future
Sergey Lourie
 
Admissions brochure-2013
Admissions brochure-2013Admissions brochure-2013
Admissions brochure-2013Justin Varilek
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
aimsnist
 
A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health Care
IJCSIS Research Publications
 
Machine Learning in Material Characterization
Machine Learning in Material CharacterizationMachine Learning in Material Characterization
Machine Learning in Material Characterization
ijtsrd
 
AI for Science
AI for ScienceAI for Science
AI for Science
inside-BigData.com
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)
PFHub PFHub
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific System
Subhasis Dasgupta
 
E bank uk_linking_research_data_scholarly
E bank uk_linking_research_data_scholarlyE bank uk_linking_research_data_scholarly
E bank uk_linking_research_data_scholarly
Luisa Francisco
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Richard Zijdeman
 
Computer Science Research Methodologies
Computer Science Research MethodologiesComputer Science Research Methodologies
Computer Science Research Methodologies
IJCSIS Research Publications
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
KAMAL CHOUDHARY
 

Similar to The interplay between data-driven and theory-driven methods for chemical sciences (20)

Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
kakkar2021.pdf
kakkar2021.pdfkakkar2021.pdf
kakkar2021.pdf
 
Future Directions in Chemical Engineering and Bioengineering
Future Directions in Chemical Engineering and BioengineeringFuture Directions in Chemical Engineering and Bioengineering
Future Directions in Chemical Engineering and Bioengineering
 
Skoltech: take a step into the future
Skoltech: take a step into the futureSkoltech: take a step into the future
Skoltech: take a step into the future
 
Admissions brochure-2013
Admissions brochure-2013Admissions brochure-2013
Admissions brochure-2013
 
When The New Science Is In The Outliers
When The New Science Is In The OutliersWhen The New Science Is In The Outliers
When The New Science Is In The Outliers
 
A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health Care
 
Machine Learning in Material Characterization
Machine Learning in Material CharacterizationMachine Learning in Material Characterization
Machine Learning in Material Characterization
 
AI for Science
AI for ScienceAI for Science
AI for Science
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)
 
Knowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific SystemKnowledge Management in the AI Driven Scintific System
Knowledge Management in the AI Driven Scintific System
 
E bank uk_linking_research_data_scholarly
E bank uk_linking_research_data_scholarlyE bank uk_linking_research_data_scholarly
E bank uk_linking_research_data_scholarly
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
Computer Science Research Methodologies
Computer Science Research MethodologiesComputer Science Research Methodologies
Computer Science Research Methodologies
 
2001pppl
2001pppl2001pppl
2001pppl
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 

More from Ichigaku Takigawa

機械学習と自動微分
機械学習と自動微分機械学習と自動微分
機械学習と自動微分
Ichigaku Takigawa
 
データ社会を生きる技術
〜機械学習の夢と現実〜
データ社会を生きる技術
〜機械学習の夢と現実〜データ社会を生きる技術
〜機械学習の夢と現実〜
データ社会を生きる技術
〜機械学習の夢と現実〜
Ichigaku Takigawa
 
機械学習を科学研究で使うとは?
機械学習を科学研究で使うとは?機械学習を科学研究で使うとは?
機械学習を科学研究で使うとは?
Ichigaku Takigawa
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
Ichigaku Takigawa
 
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Ichigaku Takigawa
 
機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開
機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開
機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開
Ichigaku Takigawa
 
機械学習と機械発見:自然科学研究におけるデータ利活用の再考
機械学習と機械発見:自然科学研究におけるデータ利活用の再考機械学習と機械発見:自然科学研究におけるデータ利活用の再考
機械学習と機械発見:自然科学研究におけるデータ利活用の再考
Ichigaku Takigawa
 
小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜
小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜
小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜
Ichigaku Takigawa
 
"データ化"する化学と情報技術・人工知能・データサイエンス
"データ化"する化学と情報技術・人工知能・データサイエンス"データ化"する化学と情報技術・人工知能・データサイエンス
"データ化"する化学と情報技術・人工知能・データサイエンス
Ichigaku Takigawa
 
自然科学における機械学習と機械発見
自然科学における機械学習と機械発見自然科学における機械学習と機械発見
自然科学における機械学習と機械発見
Ichigaku Takigawa
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro
Ichigaku Takigawa
 
決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割
決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割
決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割
Ichigaku Takigawa
 
機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと
機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと
機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと
Ichigaku Takigawa
 
自己紹介:機械学習・機械発見とデータ中心的自然科学
自己紹介:機械学習・機械発見とデータ中心的自然科学自己紹介:機械学習・機械発見とデータ中心的自然科学
自己紹介:機械学習・機械発見とデータ中心的自然科学
Ichigaku Takigawa
 
機械学習・機械発見から見るデータ中心型化学の野望と憂鬱
機械学習・機械発見から見るデータ中心型化学の野望と憂鬱機械学習・機械発見から見るデータ中心型化学の野望と憂鬱
機械学習・機械発見から見るデータ中心型化学の野望と憂鬱
Ichigaku Takigawa
 
(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから
(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから
(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから
Ichigaku Takigawa
 
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
Ichigaku Takigawa
 
帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)
帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)
帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)
Ichigaku Takigawa
 
帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)
帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)
帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)
Ichigaku Takigawa
 
(2021.7) 分子のグラフ表現と機械学習の最近
(2021.7) 分子のグラフ表現と機械学習の最近(2021.7) 分子のグラフ表現と機械学習の最近
(2021.7) 分子のグラフ表現と機械学習の最近
Ichigaku Takigawa
 

More from Ichigaku Takigawa (20)

機械学習と自動微分
機械学習と自動微分機械学習と自動微分
機械学習と自動微分
 
データ社会を生きる技術
〜機械学習の夢と現実〜
データ社会を生きる技術
〜機械学習の夢と現実〜データ社会を生きる技術
〜機械学習の夢と現実〜
データ社会を生きる技術
〜機械学習の夢と現実〜
 
機械学習を科学研究で使うとは?
機械学習を科学研究で使うとは?機械学習を科学研究で使うとは?
機械学習を科学研究で使うとは?
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
 
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
 
機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開
機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開
機械学習と機械発見:自然科学融合が誘起するデータ科学の新展開
 
機械学習と機械発見:自然科学研究におけるデータ利活用の再考
機械学習と機械発見:自然科学研究におけるデータ利活用の再考機械学習と機械発見:自然科学研究におけるデータ利活用の再考
機械学習と機械発見:自然科学研究におけるデータ利活用の再考
 
小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜
小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜
小1にルービックキューブを教えてみた 〜群論スポーツの教育とパターン認知〜
 
"データ化"する化学と情報技術・人工知能・データサイエンス
"データ化"する化学と情報技術・人工知能・データサイエンス"データ化"する化学と情報技術・人工知能・データサイエンス
"データ化"する化学と情報技術・人工知能・データサイエンス
 
自然科学における機械学習と機械発見
自然科学における機械学習と機械発見自然科学における機械学習と機械発見
自然科学における機械学習と機械発見
 
幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro幾何と機械学習: A Short Intro
幾何と機械学習: A Short Intro
 
決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割
決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割
決定森回帰の信頼区間推定, Benign Overfitting, 多変量木とReLUネットの入力空間分割
 
機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと
機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと
機械学習を自然現象の理解・発見に使いたい人に知っておいてほしいこと
 
自己紹介:機械学習・機械発見とデータ中心的自然科学
自己紹介:機械学習・機械発見とデータ中心的自然科学自己紹介:機械学習・機械発見とデータ中心的自然科学
自己紹介:機械学習・機械発見とデータ中心的自然科学
 
機械学習・機械発見から見るデータ中心型化学の野望と憂鬱
機械学習・機械発見から見るデータ中心型化学の野望と憂鬱機械学習・機械発見から見るデータ中心型化学の野望と憂鬱
機械学習・機械発見から見るデータ中心型化学の野望と憂鬱
 
(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから
(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから
(2021.11) 機械学習と機械発見:データ中心型の化学・材料科学の教訓とこれから
 
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
機械学習~データを予測に変える技術~で化学に挑む! (サイエンスアゴラ2021)
 
帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)
帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)
帰納バイアスと分子の組合せ的表現・幾何的表現 (本発表)
 
帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)
帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)
帰納バイアスと分子の組合せ的表現・幾何的表現 (3minフラッシュトーク)
 
(2021.7) 分子のグラフ表現と機械学習の最近
(2021.7) 分子のグラフ表現と機械学習の最近(2021.7) 分子のグラフ表現と機械学習の最近
(2021.7) 分子のグラフ表現と機械学習の最近
 

Recently uploaded

Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 

Recently uploaded (20)

Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 

The interplay between data-driven and theory-driven methods for chemical sciences

  • 1. The interplay between data-driven and 
 theory-driven methods for chemical sciences Ichigaku Takigawa • Medical-risk Avoidance based on iPS Cells Team,
 RIKEN Center for Advanced Intelligence Project (AIP) • Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University ichigaku.takigawa@riken.jp
  • 2. Research Interests: Ichigaku Takigawa • Machine learning and data mining technologies • Data-intensive approaches to natural sciences 10 years Hokkaido University (Computer Sciences) • Grad School. Engineering 7 years Kyoto University (Bioinformatics, Chemoinformatics) • Bioinformatics Center, Inst. Chemical Research • Grad School. Pharmaceutical Sciences (1995-2004) (2005-2011) 7 years Hokkaido University (Computer Sciences) • Grad School. Information Science and Technology • JST PRESTO on Materials Informatics (2015-2018) (2012-2018) ? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology) • Center for AI Project(2019-) Hokkaido University (Machine Learning + Chemistry) • Inst. Chemical Reaction Design and Discovery From AAAI-20 Oxford-Style Debate Kyoto Hokkaido Sapporo
  • 3. Research Interests: Ichigaku Takigawa • Machine learning and data mining technologies • Data-intensive approaches to natural sciences 10 years Hokkaido University (Computer Sciences) • Grad School. Engineering 7 years Kyoto University (Bioinformatics, Chemoinformatics) • Bioinformatics Center, Inst. Chemical Research • Grad School. Pharmaceutical Sciences (1995-2004) (2005-2011) 7 years Hokkaido University (Computer Sciences) • Grad School. Information Science and Technology • JST PRESTO on Materials Informatics (2015-2018) (2012-2018) ? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology) • Center for AI Project(2019-) Hokkaido University (Machine Learning + Chemistry) • Inst. Chemical Reaction Design and Discovery From AAAI-20 Oxford-Style Debate Kyoto Hokkaido Sapporo From AAAI-20 Oxford-Style Debate
  • 4. Research Interests: Ichigaku Takigawa • Machine learning and data mining technologies • Data-intensive approaches to natural sciences 10 years Hokkaido University (Computer Sciences) • Grad School. Engineering 7 years Kyoto University (Bioinformatics, Chemoinformatics) • Bioinformatics Center, Inst. Chemical Research • Grad School. Pharmaceutical Sciences (1995-2004) (2005-2011) 7 years Hokkaido University (Computer Sciences) • Grad School. Information Science and Technology • JST PRESTO on Materials Informatics (2015-2018) (2012-2018) ? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology) • Center for AI Project(2019-) Hokkaido University (Machine Learning + Chemistry) • Inst. Chemical Reaction Design and Discovery From AAAI-20 Oxford-Style Debate Kyoto Hokkaido Sapporo
  • 5. Machine Learning with "Discrete Structures" Q. How we can make use of "discrete structures" for prediction...? sets, sequences, branchings or hiearchies (trees), networks (graphs), relations, logic, rules, combinations, permutations,, point clouds, algebra, langages, ... H H H H H H H H O N O O H H H O O H H N O O Cl ClCl • The target variables can come with "discrete structures" • The relationship between variables can have "discrete structures" • The ML models themselves can involve "discrete structures"
  • 6. Past work: Data-intensive approaches to life science • Transcriptional regulation of metabolism
 Bioinformatics 2007, 2008a, 2008b, 2009, 2010
 Nucleic Acids Res 2011, PLoS One 2012, 2013, KDD'07 • Transcription regulation by mediator complex
 Nat Commun 2015, Nat Commun 2020 • Repetitive sequences in genomes
 Discrete Appl Math 2013, 2016, AAAI 2020 • Polypharmacology of drug-target interaction networks
 PLoS One 2011, Drug Discov Today 2013, Brief Bioinform 2014, BMC Bioinformatics 2020 • Copy number variations in neurological disorder (MSA, MS)
 Mol Brain 2017 • Substrate analysis of modulator protease (Calpain family)
 Mol Cell Proteom 2016, Genome Informatics 2009, http://calpain.org • Cell competition in cancer
 Cell Reports 2018, Sci Rep 2015 • Phenotype patterns of somatic mutations in breast cancer
 Brief Bioinform 2014
  • 7. Recent work: Data-intensive approaches to chemistry Machine learning for heterogeneous catalysis Haber–Bosch Process Ferrous Metal Catalysis NOx CO HC N2 CO2 H2O Exhaust Gas Harmless gas Noble Metal Catalysis (Pt, Pd, Rh…) • Ethane • Ethylene • Methanol • : Methane Various Metallic Catalysts
 (Li, rare earthes, alkaline earths) (industrial synthesis of ammonia) Exhaust Gas Purification Conversion of Methane “Fertilizer from Air” artificial nitrogen fixation • ACS Catalysis 2020 (review) • ChemCatChem 2019 (front cover) • J Phys Chem C 2018a, 2018b, 2019 (cover) • RSC Advances 2016 (highlighted in Chemical World) catalysis where the phase of the catalyst differs from the phase of the reactants or products. Reactants in the gas phase Catalysts in
 the solid phase
  • 8. Effective use of data is another key in natural sciences REVIEW Inverse molecular design using machine learning: Generative models for matter engineering Benjamin Sanchez-Lengeling1 and Alán Aspuru-Guzik2,3,4 * The discovery of new materials can bring enormous societal and technological progress. In this context, exploring completely the large space of potential materials is computationally intractable. Here, we review methods for achieving inverse design, which aims to discover tailored materials from the starting point of a particular desired functionality. Recent advances from the rapidly growing field of artificial intelligence, mostly from the subfield of machine learning, have resulted in a fertile exchange of ideas, where approaches to inverse molecular design are being proposed and employed at a rapid pace. Among these, deep generative models have been applied to numerous classes of materials: rational design of prospective drugs, synthetic routes to organic compounds, and optimization of photovoltaics and redox flow batteries, as well as a variety of other solid-state materials. M any of the challenges of the 21st century (1), from personalized health care to energy production and storage, share a common theme: materials are part of the solution (2). In some cases, the solu- mapped 166.4 billion molecules that contain at most 17 heavy atoms. For pharmacologically rele- vant small molecules, the number of structures is estimated to be on the order of 1060 (9). Adding consideration of the hierarchy of scale from sub- act properties. In practice, approximations are used to lower computational time at the cost of accuracy. Although theory enjoys enormous progress, now routinely modeling molecules, clusters, and perfect as well as defect-laden periodic solids, the size of chemical space is still overwhelming, and smart navigation is required. For this purpose, machine learning (ML), deep learning (DL), and artificial intelligence (AI) have a potential role to play because their computational strategies automatically improve through experience (11). In the context of materials, ML techniques are often used for property prediction, seeking to learn a function that maps a molecular material to the property of choice. Deep generative models are a special class of DL methods that seek to model the underlying probability distribution of both structure and property and relate them in a nonlinear way. By exploiting patterns in massive datasets, these models can distill average and salient features that characterize molecules (12, 13). Inverse design is a component of a more complex materials discovery process. The time scale for deployment of new technologies, from discovery in a laboratory to a commercial pro- duct, historically, is 15 to 20 years (14). The pro- cess (Fig. 1) conventionally involves the following steps: (i) generate a new or improved material FRONTIERS IN COMPUTATION httpDownloadedfrom REVIEW https://doi.org/10.1038/s41586-018-0337-2 Machine learning for molecular and materials science Keith T. Butler1 , Daniel W. Davies2 , Hugh Cartwright3 , Olexandr Isayev4 * & Aron Walsh5,6 * Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning techniques that are suitable for addressing research questions in this domain, as well as future directions for the field. We envisage a future in which the design, synthesis, characterization and application of molecules and materials is accelerated by artificial intelligence. T he Schrödinger equation provides a powerful structure– property relationship for molecules and materials. For a given spatial arrangement of chemical elements, the distribution of electrons and a wide range of physical responses can be described. The development of quantum mechanics provided a rigorous theoretical foundationforthechemicalbond.In1929,PaulDiracfamouslyproclaimed that the underlying physical laws for the whole of chemistry are “completely known”1 . John Pople, realizing the importance of rapidly developing generating, testing and refining scientific models. Such techniques are suitable for addressing complex problems that involve massive combi- natorial spaces or nonlinear processes, which conventional procedures either cannot solve or can tackle only at great computational cost. As the machinery for artificial intelligence and machine learning matures, important advances are being made not only by those in main- stream artificial-intelligence research, but also by experts in other fields (domain experts) who adopt these approaches for their own purposes. As DNA to be sequences into distinct pieces, parcel out the detailed work of sequencing, and then reassemble these independent ef- forts at the end. It is not quite so simple in the world of genome semantics. Despite the differences between genome se- quencing and genetic network discovery, there are clear parallels that are illustrated in Table 1. In genome sequencing, a physical map is useful to provide scaffolding for assembling the fin- ished sequence. In the case of a genetic regula- tory network, a graphical model can play the same role. A graphical model can represent a high-level view of interconnectivity and help isolate modules that can be studied indepen- dently. Like contigs in a genomic sequencing project, low-level functional models can ex- plore the detailed behavior of a module of genes in a manner that is consistent with the higher level graphical model of the system. With stan- dardized nomenclature and compatible model- ing techniques, independent functional models can be assembled into a complete model of the cell under study. To enable this process, there will need to be standardized forms for model representa- tion. At present, there are many different modeling technologies in use, and although models can be easily placed into a database, they are not useful out of the context of their specific modeling package. The need for a standardized way of communicating compu- tational descriptions of biological systems ex- tends to the literature. Entire conferences have been established to explore ways of mining the biology literature to extract se- mantic information in computational form. Going forward, as a community we need to come to consensus on how to represent what we know about biology in computa- tional form as well as in words. The key to postgenomic biology will be the computa- tional assembly of our collective knowl- edge into a cohesive picture of cellular and organism function. With such a comprehen- sive model, we will be able to explore new types of conservation between organisms and make great strides toward new thera- peutics that function on well-characterized pathways. References 1. S. K. Kim et al., Science 293 , 2087 (2001). 2. A. Hartemink et al., paper presented at the Pacific Symposium on Biocomputing 2000, Oahu, Hawaii, 4 to 9 January 2000. 3. D. Pe’er et al., paper presented at the 9th Conference on Intelligent Systems in Molecular Biology (ISMB), Copenhagen, Denmark, 21 to 25 July 2001. 4. H. McAdams, A. Arkin, Proc. Natl. Acad. Sci. U.S.A. 94 , 814 ( 1997 ). 5. A. J. Hartemink, thesis, Massachusetts Institute of Technology, Cambridge (2001). V I E W P O I N T Machine Learning for Science: State of the Art and Future Prospects Eric Mjolsness* and Dennis DeCoste Recent advances in machine learning methods, along with successful applications across a wide variety of fields such as planetary science and bioinformatics, promise powerful new tools for practicing scientists. This viewpoint highlights some useful characteristics of modern machine learn- ing methods and their relevance to scientific applications. We conclude with some speculations on near-term progress and promising directions. Machine learning (ML) (1) is the study of computer algorithms capable of learning to im- prove their performance of a task on the basis of their own previous experience. The field is closely related to pattern recognition and statis- tical inference. As an engineering field, ML has become steadily more mathematical and more successful in applications over the past 20 years. Learning approaches such as data clus- tering, neural network classifiers, and nonlinear regression have found surprisingly wide appli- cation in the practice of engineering, business, and science. A generalized version of the stan- dard Hidden Markov Models of ML practice have been used for ab initio prediction of gene structures in genomic DNA (2). The predictions correlate surprisingly well with subsequent gene expression analysis (3). Postgenomic bi- ology prominently features large-scale gene ex- pression data analyzed by clustering methods (4), a standard topic in unsupervised learning. Many other examples can be given of learning and pattern recognition applications in science. Where will this trend lead? We believe it will lead to appropriate, partial automation of every element of scientific method, from hypothesis generation to model construction to decisive experimentation. Thus, ML has the potential to amplify every aspect of a working scientist’s progress to understanding. It will also, for better or worse, endow intelligent computer systems with some of the general analytic power of scientific thinking. creating hypotheses, testing by decisive exper- iment or observation, and iteratively building up comprehensive testable models or theories is shared across disciplines. For each stage of this abstracted scientific process, there are relevant developments in ML, statistical inference, and pattern recognition that will lead to semiauto- matic support tools of unknown but potentially broad applicability. Increasingly, the early elements of scientific method—observation and hypothesis genera- tion—face high data volumes, high data acqui- sition rates, or requirements for objective anal- ysis that cannot be handled by human percep- tion alone. This has been the situation in exper- imental particle physics for decades. There automatic pattern recognition for significant events is well developed, including Hough transforms, which are foundational in pattern recognition. A recent example is event analysis for Cherenkov detectors (8) used in neutrino oscillation experiments. Microscope imagery in cell biology, pathology, petrology, and other fields has led to image-processing specialties. So has remote sensing from Earth-observingMachine Learning Systems Group, Jet Propulsion Lab- Table 1. Parallels between genome sequencing and genetic network discovery. Genome sequencing Genome semantics Physical maps Graphical model Contigs Low-level functional models Contig reassembly Module assembly Finished genome sequence Comprehensive model C O M P U T E R S A N D S C I E N C E onAugust29,2018http://science.sciencemag.org/Downloadedfrom Nature, 559
 pp. 547–555 (2018) Science, 293 pp. 2051-2055 (2001) Science, 361 pp. 360-365 (2018) Science is changing, the tools of science are changing. And that requires different approaches. ── Erich Bloch, 1925-2016 (alongside experiments and simulations) And please keep in mind that unplanned data collection is too risky. We need right designs for data collection and right tools to analyze. A bitter lesson: "low input, high throughput, no output science." (Sydney Brenner)
  • 9. Today's AI has stark limitations, facing big problems • Deep learning techniques thus far have proven to be data hungry, shallow, brittle, and limited in their ability to generalize (Marcus, 2018) • Current machine learning techniques are data-hungry and brittle—they can only make sense of patterns they've seen before. (Chollet, 2020) • A growing body of evidence shows that state-of-the-art models learn to exploit spurious statistical patterns in datasets... instead of learning meaning in the flexible and generalizable way that humans do. (Nie et al., 2019) • Current machine learning methods seem weak when they are required to generalize beyond the training distribution, which is what is often needed in practice. (Bengio et al., 2019) Though it's extremely powerful and useful in suitable applications!
  • 10. The AI/ML community is now struggling with these... • The Winograd Schema Challenge (Levesque et al., 2011) A collection of 273 multiple-choice problems that requires commonsense reasoning to solve as an alternative to Turing test (Turing, 1950). Many methods passed this test of "Imitation Games" were actually tricks being fairly adept at fooling humans... Unexpectedly, it turned out that several tricks using BERT/Transformer can crack many of these (with >90% accuracy) ... • WinoGrande (Sakaguchi et al., 2020) An improved collection of 44k very hard problems
 (AAAI-20 Outstanding Paper Award) 
 Though some top-ranking methods can have 70-80% accuracy ... • Abstraction and Reasoning Challenge at Kaggle (Chollet, 2020) "Kaggle’s most important AI competition" "Impossible AI Challenge" At the moment, to what extent we can create an AI capable of solving reasoning tasks it has never seen before?
  • 11. Examples WSC-273 WinoGrande-1.1 (Q0) The town councillors refused to give the angry demonstrators a permit because they feared violence. Who feared violence? • the town councillors • the angry demonstrators (Q1) The trophy would not fit in the brown suitcase because it was so small. What was so small? • the trophy • the brown suitcase (due to Terry Winograd) (Q) He never comes to my home, but I always go to his house because the ______ is smaller. 1. home 2. house
  • 12. Take home message The current "end-to-end" or "fully data-driven" strategy of ML is too data-hungry. But in many cases, we cannot have enough data for various practical restrictions (cost, time, ethics, privacy, etc).
  • 13. Take home message The current "end-to-end" or "fully data-driven" strategy of ML is too data-hungry. But in many cases, we cannot have enough data for various practical restrictions (cost, time, ethics, privacy, etc). • Model-based learning, Neuro-symbolic or ML-GOFAI integration.
 We can partly use explicit models for well-understood parts for sample efficiency and for filling the gap between correlation and causation.
  • 14. Take home message The current "end-to-end" or "fully data-driven" strategy of ML is too data-hungry. But in many cases, we cannot have enough data for various practical restrictions (cost, time, ethics, privacy, etc). • Model-based learning, Neuro-symbolic or ML-GOFAI integration.
 We can partly use explicit models for well-understood parts for sample efficiency and for filling the gap between correlation and causation. • Needs for modeling unverbalizable comon sense of domain experts
 We need a good strategy for building a gigantic data collection for this,
 as well as self-supervised learning and/or meta learning algorithms.
 "Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
  • 15. Take home message The current "end-to-end" or "fully data-driven" strategy of ML is too data-hungry. But in many cases, we cannot have enough data for various practical restrictions (cost, time, ethics, privacy, etc). • Model-based learning, Neuro-symbolic or ML-GOFAI integration.
 We can partly use explicit models for well-understood parts for sample efficiency and for filling the gap between correlation and causation. • Needs for modeling unverbalizable comon sense of domain experts
 We need a good strategy for building a gigantic data collection for this,
 as well as self-supervised learning and/or meta learning algorithms.
 "Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun) • Needs for novel techniques for compositionality (combinatorial generalization), out-of-distribution prediction (extrapolation), and
 their flexible transfer
 We need to somehow combine and flexibly transfer partial knowledge to generate a new thing or deal with completely new situations.
  • 16. Take home message The current "end-to-end" or "fully data-driven" strategy of ML is too data-hungry. But in many cases, we cannot have enough data for various practical restrictions (cost, time, ethics, privacy, etc). • Model-based learning, Neuro-symbolic or ML-GOFAI integration.
 We can partly use explicit models for well-understood parts for sample efficiency and for filling the gap between correlation and causation. • Needs for modeling unverbalizable comon sense of domain experts
 We need a good strategy for building a gigantic data collection for this,
 as well as self-supervised learning and/or meta learning algorithms.
 "Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun) • Needs for novel techniques for compositionality (combinatorial generalization), out-of-distribution prediction (extrapolation), and
 their flexible transfer
 We need to somehow combine and flexibly transfer partial knowledge to generate a new thing or deal with completely new situations.
  • 17. They are trantitions from a valley to a hill to another valley, but solving a quantum
 chemical equation is needed at every point to get the energy surface Chemistry has the first principle at the electron level θ1 θ2 Schrödinger equation Potential Energy Surface (PES) Outcomes we see are just
 discrete and combinatorialstable state 1 transition state stable state 2 Energy Every substance is a combination of only 118 elements in the periodic table. Our end goal: take full control over chemical reactions,
 the basis for transforming substances to another (energy, materials, foods, colors, ...) Recombinations of atoms and chemical bonds are subjected to the laws of nature
  • 18. Our core technology for automated reaction-path search ADDF
 Ohno & Maeda, Chem Phys Lett, 2004 AFIR Maeda & Morokuma, J Chem Phys, 2010
  • 19. But chemistry still remains quite empirical, and why? Even though current chemical calculations are fairly accurate, 
 chemists would still heavily rely on "Edisonian empiricism (trial & error)." 1. empirical knowledge (from the literature) and their flexible transfer 2. intuitions and experiences (unverbalizable common sense of experts)
  • 20. But chemistry still remains quite empirical, and why? Even though current chemical calculations are fairly accurate, 
 chemists would still heavily rely on "Edisonian empiricism (trial & error)." 1. empirical knowledge (from the literature) and their flexible transfer 2. intuitions and experiences (unverbalizable common sense of experts) Takeaway: We also need 'data-driven' bridges!
 first principles are not enough for us to throw away these empirical things; data-driven approaches (ML) play a complementary role! Fusing multi-disciplinary methods...? 1. Theory-driven: (slow but) high-fedelity quantum-chemical simulations 2. Knowledge-driven: explicit knowns and inference 3. Data-driven: empirical prediction grounded upon multifaceted data
  • 21. Indeed computational chemistry has many limitations... Chemical reactions = recombinations of atoms and chemical bonds
 subjected to the laws of nature • Intractably large chemical space: A intractably large number of "theoretically possible" candidates for reactions and compounds... • Computing time and scalability issue: Simulating an Avogadro-constant number of atoms is utterly infeasible... (we need some compromise here) • Complexity and uncertainty of real-world systems: 
 Many uncertain factors and arbitrary parameters are involved... • Known and unknown imperfections of currently established theories: Current theoretical calculations have many exceptions and limitations...
  • 22. Yet another approach: Data-driven Cause-and-Effect
 Relationship Related factors
 (and their states) Outcome Reactions Some mechanism [Inputs] [Outputs] Theory-driven methods try to explicitly model the inner workings of a target phenomenon (e.g. through first-principles simulations) Data-driven methods try to precisely approximate its outer behavior (the input-output relationship) observable as "data". 
 (e.g. through machine learning from a large collection of data) based on very different principles and quite complementary! governing equation?
  • 23. Machine Learning (ML) Generic Object Recognition Speech Recognition Machine Translation QSAR/QSPR Prediction AI Game Players “ありがとう” J’aime la musique I love music CH3 N N H N H H3C N Growth inhibition 0.739 A new style of programming a technique to reproduce a transformation process (or function) where the underlying principle is unclear and hard to be explicitly modelled 
 just by giving a lot of input-output examples.
  • 24. How ML works: fitting a function to data Inputs Outputs ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> A function best fitted to a given set of example input-output pairs (the training data). (x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit> f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> ✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> interpolative prediction (High-dimensional) HighLow Model Complexity Underfitting (High bias, Low variance) Overfitting (Low bias, High variance) "The bias-variance tradeoff" The training data f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit> extrapolative prediction
  • 25. How ML works: fitting a function to data Inputs Outputs ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> A function best fitted to a given set of example input-output pairs (the training data). (x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit> f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> ✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit> y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit> interpolative prediction (High-dimensional) HighLow Model Complexity Underfitting (High bias, Low variance) Overfitting (Low bias, High variance) "The bias-variance tradeoff" The training data f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit> extrapolative prediction e.g.) Deep learning usually comes with overparametrization and the number of parameters can be several tens of millions or even billions... (This is why model-complexity control or regularization is so critical and essential in ML) This can be often ultra high-dimensional in modern ML
  • 26. Multilevel representations of chemical reactions Brc1cncc(Br)c1 C[O-] CN(C)C=O Na+ COc1cncc(Br)c1SMILES Structural Formla Steric Structures Electronic States Reactants Reagents Products As pattern languages (e.g. known facts in textbooks/databases) As physical entities (e.g. quantum chemical calculations)
  • 27. Computer-assisted synthetic planning Corey+ 1972 J Am Chem Soc (JACS), 94(2), 1972. 440 Computer-Assisted Synthetic Analysis for Complex Molecules. Methods and Procedures for Machine Generation of Synthetic Intermediates E. J. Corey,* Richard D. Cramer III, and W. Jeffrey Howe Contribution from the Department of Chemistry, Harvard University, Cambridge, Massachusetts 02138. Received January 30, 1971 Abstract: A classification of synthetic reactions is outlined which is suitable for use in a machine program to generate a tree of synthetic intermediates starting from a given target molecule. The generation of a particular intermediate by the program involves the search of appropriate data tables of synthetic processes, the search being driven by the information obtained by machine perception of the parent structure and certain basic strategies. Procedures have been developed for the evaluation of chemical interconversions which allow the effective exclusion of invalid or naive structures. The paper provides a view of the status of computer-assisted synthetic problem solving as of 1970. The communication of chemical structural informa- tion to and from a digital computer by graphical methods has been discussed in detail in a foregoing paper,1 as has the machine representation and percep- tion of key features within structures,2 as for example, functional groups and rings. This paper is concerned with the ways in which the structural information made available by the perception process can be utilized to generate a tree of chemical structures3 which represent possible synthetic intermediates for the construction of a complex target molecule. More specifically, the following topics will be treated: (1) classification of and necessary control strategies, and also for eventual inclusion of a fairly complete collection of families. In the discussion which follows, the degree of imple- mentation of each area of study will be cited. A variety of rational schemes for creating families of synthetic reactions already exists. However, most of these depend on properties of the reactants,5 and as such they are irrelevant to a computer program which analyzes the features of a target or product molecule in order to generate appropriate starting materials. One very general treatment of synthetic reactions in- volves classification on the basis of the structural Computer-Assisted Solution of Chemical Problems- The Historical Development and the Present State of the Art of a New Discipline of Chemistry By Ivar Ugi," Johannes Bauer, Klemens Bley, Alf Dengler, Andreas Dietz, Eric Fontain, Bernhard Gruber, Rainer Herges, Michael Knauer, Klaus Reitsam, and Natalie Stein Dedicated to Projkssor Karl-Heinz Biicliel The topic of this article is the development and the present state of the art of computer chemistry, the computer-assisted solution of chemical problems. Initially the problems in computer chemistry were confined to structure elucidation on the basis of spectroscopic data, then programs for synthesis design based on libraries of reaction data for relatively narrow classes of target compounds were developed, and now computer programs for the solution of a great variety of chemical problems are available or are under development. Previously it was an achievement when any solution of a chemical problem could be generated by computer assistance. Today, the main task is the efficient, transparent, and non-arbitrary selection of meaningful results from the immense set of potential solutions--that also may contain innova- tive proposals. Chemistry has two aspects, constitutional chemistry and stereochemistry, which are interrelated, but still require different approaches. As a result, about twenty years ago, an algebraic model of the logical structure of chemistry was presented that consisted of two parts: the constitution-oriented algebra of be- and r-matrices, and the theory of the Ugi+ 1993 Angew Chem Int Ed Engl. 32, 202-227, 1993. A traditional chemoinformtics topic since 70s:
 collect known chemical reactions, and search for reaction paths e.g.) Representation, classification, exploration of chemical reactions
  • 28. Knowledge-based approaches Widely-used proprietary systems
 (template-based path search and expansions) Now towards fusing with Machine Learning...?
  • 29. Purely ML-based approaches and more ML-based chemical reaction predictions Fermionic Neural Network Pfau+ Ab-Initio Solution of the Many-Electron Schrödinger Equation with Deep Neural Networks. 
 arXiv:1909.02487, Sep 2019. Hamiltonian Graph Networks with ODE Integrators Sanchez-Gonzalez+ Hamiltonian Graph Networks with ODE Integrators. 
 arXiv:1909.12790, Sep 2019. Both from ML + First-principle simulations 3N-MCTS/AlphaChem
 Segler+ Nature 2018 Molecular Transformer
 Schwaller+ ACS Cent Sci 2019 seq2seq
 Liu+ ACS Cent Sci 2017 WLDN
 Jin+ NeurIPS 2017 ELECTRO
 Bradshaw+ICLR 2019 WLN
 Coley+ Chem Sci 2019 GPTN
 Do+ KDD 2019 Graph Neural Networks Sequence Neural Networks Combined or Other IBM RXN
 Schwaller+ Chem Sci 2018 Molecule Chef
 Bradshaw+ DeepGenStruct
 (ICLR WS) 2019 Neural-Symbolic ML
 Segler+ Chemistry 2017 Similarity-based
 Coley+ ACS Cent Sci 2017 GLN
 Dai+ NeurIPS2019 Transformer
 Karpov+ ICANN 2019
  • 30. Similar technologies go for biology! DeepMind's AlphaFold paper A successful example from MIT MLPDS
  • 31. How to integrate data-driven and theory-driven? • Theory-driven: first-principles simulations, logical reasoning, 
 mathematical models, etc. • Data-driven: machine learning Still emerging topics, 
 but examples of already available approaches are 1. Wrapper: Use ML to control or plan simulations
 Basically we solve the problem by simulations, but use ML models to ask "what if?" questions to guide what simulations to run.
 (Model-based optimization, Sequential design of experiments, Reinforcement learning, Generative methods, etc) 2. Hybrid: Use ML as approximator for unsure parts of simulations
 Plugging ML models into the unsure part of simulations or
 calling simulations when ML predictions are less confident
 (Data assimilation, domain adaptation, semi-empirical methods, etc)
  • 32. Back to our example: Heterogeneous Catalysis Wolfgang Pauli “God made the bulk; 
 the surface was invented by the devil.” adsorption diffusion desorption dissociation recombination kinks terraces adatom vacancysteps Gas-Phase (Reactants) Sold-Phase (Catalysts) Many hard-to-quantify intertwined factors involved. Too complicated (impossible?) to model everything... • multiple elementary reaction processes • composition, support, surface termination, particle size, particle morphology, atomic coordination environment • reaction conditions Notoriously complex surface reactions between different phases.
  • 33. Our ML-based case studies 1. Can we predict the d-band center? 2. Can we predict the adsorption energy? 3. Can we predict the catalytic activity? predicting DFT-calculated values by machine learning   (Takigawa et al, RSC Advances, 2016) predicting DFT-calculated values by machine learning   (Toyao et al, JPCC, 2018) predicting values from experiments reported in the  literature by machine learning   (Suzuki et al, ChemCatChem, 2019)
  • 34. • Problem: Very strong "selection bias" in existing datasets Catalyst research has relied heavily on prior published data, strongly biased toward catalyst composition that were successful One of big problems we had Example) Oxidative coupling of methane (OCM) • 1868 catalysts in the original dataset [Zavyalova+ 2011] • Composed of 68 different elements: 61 cations and 7 anions (Cl, F, Br, B, S, C, and P) excluding oxygen • Occurrences of only a few elements such as La, Ba, Sr, Cl, Mn, and F are very high. • Widely used elements such as Li, Mg, Na, Ca, and La also frequent in the data
  • 35. An ML model is just representative of the training data Highly Inaccurate Model Predictions from Extrapolation (Lohninger 1999) "Beware of the perils of extrapolation, and understand that ML algorithms build models that are representative of the available training samples." Simply focusing on targets predicted as high-performance ones by ML built on currently avialable data is clearly not a good idea... Like too much scrutinizing any so-so candidates at the moment that we stumbled upon at the very early stage of research...
  • 36. An ML model is just representative of the training data Highly Inaccurate Model Predictions from Extrapolation (Lohninger 1999) "Beware of the perils of extrapolation, and understand that ML algorithms build models that are representative of the available training samples." Simply focusing on targets predicted as high-performance ones by ML built on currently avialable data is clearly not a good idea... Like too much scrutinizing any so-so candidates at the moment that we stumbled upon at the very early stage of research... In reality, distinction between interpolation and extrapolation are not that clear due to the high-dimensionality.
  • 37. No guarantee of data-driven for the outside of given data Cause-and-Effect
 Relationship Related factors
 (and their states) Outcome Reactions Some mechanism [Inputs] [Outputs] Data-driven methods try to precisely approximate its outer behavior (the input-output relationship) observable as "data". 
 (e.g. through machine learning from a large collection of data) Keep in mind: Given data DEFINES the data-driven prediction! "Current machine learning techniques are data-hungry and brittle —they can only make sense of patterns they've seen before." 
 (Chollet, 2020)
  • 38. Our Solution: Model-based optimization Use ML to guide the balance between "exploitation" and "exploration"! • Exploitation: 
 what we already know and get something close to what we expect • Exploration:
 something we aren't sure about and possibly learn more Either through experiments or simulations, we face a dilemma in choosing between options to learn new things. ML prediction
 (function fitting) amount 
 we want to maximize max(best) for now
  • 39. Our Solution: Model-based optimization Use ML to guide the balance between "exploitation" and "exploration"! • Exploitation: 
 what we already know and get something close to what we expect • Exploration:
 something we aren't sure about and possibly learn more Either through experiments or simulations, we face a dilemma in choosing between options to learn new things. ML prediction
 (function fitting) prediction variance (uncertainty) amount 
 we want to maximize max(best) for now
  • 40. Our Solution: Model-based optimization Use ML to guide the balance between "exploitation" and "exploration"! • Exploitation: 
 what we already know and get something close to what we expect • Exploration:
 something we aren't sure about and possibly learn more Either through experiments or simulations, we face a dilemma in choosing between options to learn new things. ML prediction
 (function fitting) prediction variance (uncertainty) amount 
 we want to maximize max(best) for now very confident (we have data) not confident (no data around)
  • 41. Our Solution: Model-based optimization Use ML to guide the balance between "exploitation" and "exploration"! • Exploitation: 
 what we already know and get something close to what we expect • Exploration:
 something we aren't sure about and possibly learn more Either through experiments or simulations, we face a dilemma in choosing between options to learn new things. e.g.
 "expected improvement" ML prediction
 (function fitting) prediction variance (uncertainty) amount 
 we want to maximize max(best) for now
  • 42. Our Solution: Model-based optimization Use ML to guide the balance between "exploitation" and "exploration"! • Exploitation: 
 what we already know and get something close to what we expect • Exploration:
 something we aren't sure about and possibly learn more Either through experiments or simulations, we face a dilemma in choosing between options to learn new things. e.g.
 "expected improvement" ML prediction
 (function fitting) prediction variance (uncertainty) probing area with no-data or scarse data relying on available data amount 
 we want to maximize max(best) for now
  • 43. AlphaGo
 (Nature, Jan 2016) AlphaGo Zero
 (Nature, Oct 2017) AlphaZero
 (Science, Dec 2018) • Algorithm Configuration • Hyperparameter Optimization (HPO) • Neural Architecture Search (NAS) • Meta Learning / Learning to Learn Amazon SageMaker MuZero
 (arXiv, Nov 2019) Key: model-based and exploitation-exploration balance Model-based reinforcement learning AutoML (Use ML for tuning ML)
  • 44. Facts from experiments and calculations Rationalize & Accelerate Chemical Design and Discovery In-House data + Public data + Knowledge base
 (and their quality control & annotations) Hypothesis generation Validation (Experiments and/or Simulations)(Machine learning, Data mining) • Planning what to test in the next experiments or simulations • Surrogates to expensive or time- consuming experiments or simulations • Optimize uncertain factors or conditions • Multilevel information fusion • Highly Reproducible experiments with high accuracies and speeds • Acceleration with ML-based surrogates for time-consuming subproblems • Simulating many 'what-if' situations Key: Effective use of data with a help with data-driven techniques
  • 45. This trend emerged first in life sciences (drug discovery)
  • 46. Next expanded to materials science Little human intervention for highly reproducible large-scale production lines Automation, monitoring with IoT, and big-data management are also the key to manufacturing. Now these focuses shifted to the R & D phases. (very experimental and empirical traditionally)
  • 47. Next expanded to materials science Little human intervention for highly reproducible large-scale production lines Automation, monitoring with IoT, and big-data management are also the key to manufacturing. Now these focuses shifted to the R & D phases. (very experimental and empirical traditionally)
  • 48. ... also to chemistry!
  • 49. Summary • Model-based learning, Neuro-symbolic or ML-GOFAI integration.
 We can partly use explicit models for well-understood parts for sample efficiency and for filling the gap between correlation and causation • Needs for modeling unverbalizable comon sense of domain experts
 We need a good strategy for building a gigantic data collection for this,
 as well as self-supervised learning and/or meta learning algorithms.
 
 • Needs for novel techniques for compositionality (combinatorial generalization), out-of-distribution prediction (extrapolation), and
 their flexible transfer
 We need to somehow combine and flexibly transfer partial knowledge to generate a new thing or deal with completely new situations. "Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun) The current "end-to-end" or "fully data-driven" strategy of ML is too data-hungry. But in many cases, we cannot have enough data for various practical restrictions (cost, time, ethics, privacy, etc).
  • 50. Reviewer: 1
 I don't usually recommend that papers should be accepted "as is", but in this case I don't see the need for changes. This review should be accepted and published in ACS Catalysis. ... I will certainly recommend it to my group and my students when it is published. 
 Reviewer: 2
 The manuscript gives an excellent over the field of machine learning especially with regard to heterogeneous catalysis and I would highly recommend the article for the publication in ACS Catalysis.
 Reviewer: 3
 This is one of the best reviews for catalyst informatics that the Reviewer has read. In particular, the chapter 2 delivers a very good tutorial, which is concisely and professionally written. Chapter 2 is the general user's guide of ML for natural sciences. ACS Catalysis, 2020; 10: 2260-2297.