The 1st International Symposium on Human InformatiX
X-Dimensional Human Informatics and Biology
ATR, Kyoto, February 27-28, 2020
https://human-informatix.atr.jp
Hokkaido University (HU) - Seoul National University (SNU) Joint Symposium
2018 International Workshop on
New Frontiers in Convergence Science and Technology
How to use data to design and optimize reaction? A quick introduction to work...Ichigaku Takigawa
(Journal Club) ICReDD Seminar, Apr 27 2020
Institute for Chemical Reaction Design and Discovery (ICReDD)
Hokkaido University
Sapporo, JAPAN
https://www.icredd.hokudai.ac.jp
Friday, October 15th, 2021, Sapporo, Hokkaido, Japan.
Hokkaido University ICReDD - Faculty of Medicine Joint Symposium
https://www.icredd.hokudai.ac.jp/event/5993
ICReDD (Institute for Chemical Reaction Design and Discovery)
https://www.icredd.hokudai.ac.jp
Machine Learning for Chemistry: Representing and InterveningIchigaku Takigawa
Joint Symposium of Engineering & Information Science & WPI-ICReDD in Hokkaido University
Apr. 26 (Mon), 2021
https://www.icredd.hokudai.ac.jp/event/5430
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryIchigaku Takigawa
Perspectives on Artificial Intelligence and Machine Learning in Materials Science
February 4, 2022. – February 6, 2022.
https://joint.imi.kyushu-u.ac.jp/post-2698/
Hokkaido University (HU) - Seoul National University (SNU) Joint Symposium
2018 International Workshop on
New Frontiers in Convergence Science and Technology
How to use data to design and optimize reaction? A quick introduction to work...Ichigaku Takigawa
(Journal Club) ICReDD Seminar, Apr 27 2020
Institute for Chemical Reaction Design and Discovery (ICReDD)
Hokkaido University
Sapporo, JAPAN
https://www.icredd.hokudai.ac.jp
Friday, October 15th, 2021, Sapporo, Hokkaido, Japan.
Hokkaido University ICReDD - Faculty of Medicine Joint Symposium
https://www.icredd.hokudai.ac.jp/event/5993
ICReDD (Institute for Chemical Reaction Design and Discovery)
https://www.icredd.hokudai.ac.jp
Machine Learning for Chemistry: Representing and InterveningIchigaku Takigawa
Joint Symposium of Engineering & Information Science & WPI-ICReDD in Hokkaido University
Apr. 26 (Mon), 2021
https://www.icredd.hokudai.ac.jp/event/5430
Machine Learning for Molecules: Lessons and Challenges of Data-Centric ChemistryIchigaku Takigawa
Perspectives on Artificial Intelligence and Machine Learning in Materials Science
February 4, 2022. – February 6, 2022.
https://joint.imi.kyushu-u.ac.jp/post-2698/
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...Ichigaku Takigawa
2nd ICReDD International Symposium—Toward Interdisciplinary Research Guided by Theory and Calculation
Nov. 27 (wed) - Nov. 29 (fri), 2019
https://www.icredd.hokudai.ac.jp/event/1229
Machine Learning for Molecular Graph Representations and GeometriesIchigaku Takigawa
Dec 1, 2021, Pacifico Yokohama, Japan.
Symposium 1AS-17 "Data science and machine learning: Tackling the Noise and Heterogeneity of the Real World"
The 44th Annual Meetingn of the Molecular Biology Society of Japan
https://www2.aeplan.co.jp/mbsj2021/english/designation/index.html
Computational methods to analyze biological data. It is a way to introduce some of the many resources available for analyzing sequence data with bioinformatics software. This paper will cover the theoretical approaches to data resources and we will get knowledge about some sequential alignments with its databases. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics, and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Databases are essential for bioinformatics research and applications. Many databases exist, covering various information types for example, DNA and protein sequences, molecular structures, phenotypes, and biodiversity. Databases may contain empirical data. Conceptualizing biology in terms of molecules and then applying informatics techniques from math, computer science, and statistics to understand and organize the information associated with these molecules on a large scale. In this materialistic world, People are studying bioinformatics in different ways. Some people are devoted to developing new computational tools, both from software and hardware viewpoints, for the better handling and processing of biological data. They develop new models and new algorithms for existing questions and propose and tackle new questions when new experimental techniques bring in new data. Other people take the study of bioinformatics as the study of biology with the viewpoint of informatics and systems. Durgesh Raghuvanshi | Vivek Solanki | Neha Arora | Faiz Hashmi "Computational of Bioinformatics" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30891.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30891/computational-of-bioinformatics/durgesh-raghuvanshi
Over the past decade, unprecedented progress in the development of neural networks influenced dozens of different industries, including weed recognition in the agro-industrial sector. The use of neural networks in agro-industrial activity in the task of recognizing cultivated crops is a new direction. The absence of any standards significantly complicates the understanding of the real situation of the use of the neural network in the agricultural sector. The manuscript presents the complete analysis of researches over the past 10 years on the use of neural networks for the classification and tracking of weeds due to neural networks. In particular, the analysis of the results of using various neural network algorithms for the task of classification and tracking was presented. As a result, we presented the recommendation for the use of neural networks in the tasks of recognizing a cultivated object and weeds. Using this standard can significantly improve the quality of research on this topic and simplify the analysis and understanding of any paper.
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...YogeshIJTSRD
The main recommendations of this article mainly analyzing the rate of harmful elements the period of exploitation of the automobile implements and its services to develop activity of automobile implements of the exploitation period. Shavkat Giyazov "Analysis of Existing Models in Relation to the Problems of Mass Exchange between Autotransport Complex and the Environment" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd38681.pdf Paper URL: https://www.ijtsrd.com/engineering/automotive-engineering/38681/analysis-of-existing-models-in-relation-to-the-problems-of-mass-exchange-between-autotransport-complex-and-the-environment/shavkat-giyazov
Listing of Intellectual work of patanjali kashyap , mainly contains , name, details and reference of papers , presentations , patents , public speaking
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyNathan Frey, PhD
Machine learning and artificial intelligence have transformed our online experience, and for an increasing number of individuals, these fields are fundamentally changing the way we work. In this talk, I will discuss how machine learning is used in the physical sciences, particularly materials science and chemistry, and what transformative impacts we have seen or might expect to see in the future. This discussion will focus on the unique challenges (and opportunities) faced by materials and chemistry researchers applying machine learning in their work. I will present a brief introduction to machine learning for physical scientists and give examples related to synthesis, property prediction and engineering, and artificial intelligence that “reads” research articles. These examples will introduce some of the most prevalent and useful open-source software tools that drive modern machine learning applications. Two significant themes will be emphasized throughout: the careful evaluation of machine learning results and the central importance of data quality and quantity. Finally, I will provide some mundane, “human learned” speculation about the future of machine learning in physical science and recommended resources for further study.
Machine Learning and Model-Based Optimization for Heterogeneous Catalyst Desi...Ichigaku Takigawa
2nd ICReDD International Symposium—Toward Interdisciplinary Research Guided by Theory and Calculation
Nov. 27 (wed) - Nov. 29 (fri), 2019
https://www.icredd.hokudai.ac.jp/event/1229
Machine Learning for Molecular Graph Representations and GeometriesIchigaku Takigawa
Dec 1, 2021, Pacifico Yokohama, Japan.
Symposium 1AS-17 "Data science and machine learning: Tackling the Noise and Heterogeneity of the Real World"
The 44th Annual Meetingn of the Molecular Biology Society of Japan
https://www2.aeplan.co.jp/mbsj2021/english/designation/index.html
Computational methods to analyze biological data. It is a way to introduce some of the many resources available for analyzing sequence data with bioinformatics software. This paper will cover the theoretical approaches to data resources and we will get knowledge about some sequential alignments with its databases. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics, and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. Databases are essential for bioinformatics research and applications. Many databases exist, covering various information types for example, DNA and protein sequences, molecular structures, phenotypes, and biodiversity. Databases may contain empirical data. Conceptualizing biology in terms of molecules and then applying informatics techniques from math, computer science, and statistics to understand and organize the information associated with these molecules on a large scale. In this materialistic world, People are studying bioinformatics in different ways. Some people are devoted to developing new computational tools, both from software and hardware viewpoints, for the better handling and processing of biological data. They develop new models and new algorithms for existing questions and propose and tackle new questions when new experimental techniques bring in new data. Other people take the study of bioinformatics as the study of biology with the viewpoint of informatics and systems. Durgesh Raghuvanshi | Vivek Solanki | Neha Arora | Faiz Hashmi "Computational of Bioinformatics" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30891.pdf Paper Url :https://www.ijtsrd.com/engineering/computer-engineering/30891/computational-of-bioinformatics/durgesh-raghuvanshi
Over the past decade, unprecedented progress in the development of neural networks influenced dozens of different industries, including weed recognition in the agro-industrial sector. The use of neural networks in agro-industrial activity in the task of recognizing cultivated crops is a new direction. The absence of any standards significantly complicates the understanding of the real situation of the use of the neural network in the agricultural sector. The manuscript presents the complete analysis of researches over the past 10 years on the use of neural networks for the classification and tracking of weeds due to neural networks. In particular, the analysis of the results of using various neural network algorithms for the task of classification and tracking was presented. As a result, we presented the recommendation for the use of neural networks in the tasks of recognizing a cultivated object and weeds. Using this standard can significantly improve the quality of research on this topic and simplify the analysis and understanding of any paper.
Analysis of Existing Models in Relation to the Problems of Mass Exchange betw...YogeshIJTSRD
The main recommendations of this article mainly analyzing the rate of harmful elements the period of exploitation of the automobile implements and its services to develop activity of automobile implements of the exploitation period. Shavkat Giyazov "Analysis of Existing Models in Relation to the Problems of Mass Exchange between Autotransport Complex and the Environment" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd38681.pdf Paper URL: https://www.ijtsrd.com/engineering/automotive-engineering/38681/analysis-of-existing-models-in-relation-to-the-problems-of-mass-exchange-between-autotransport-complex-and-the-environment/shavkat-giyazov
Listing of Intellectual work of patanjali kashyap , mainly contains , name, details and reference of papers , presentations , patents , public speaking
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyNathan Frey, PhD
Machine learning and artificial intelligence have transformed our online experience, and for an increasing number of individuals, these fields are fundamentally changing the way we work. In this talk, I will discuss how machine learning is used in the physical sciences, particularly materials science and chemistry, and what transformative impacts we have seen or might expect to see in the future. This discussion will focus on the unique challenges (and opportunities) faced by materials and chemistry researchers applying machine learning in their work. I will present a brief introduction to machine learning for physical scientists and give examples related to synthesis, property prediction and engineering, and artificial intelligence that “reads” research articles. These examples will introduce some of the most prevalent and useful open-source software tools that drive modern machine learning applications. Two significant themes will be emphasized throughout: the careful evaluation of machine learning results and the central importance of data quality and quantity. Finally, I will provide some mundane, “human learned” speculation about the future of machine learning in physical science and recommended resources for further study.
Future Directions in Chemical Engineering and BioengineeringIlya Klabukov
"Future Directions in Chemical Engineering and Bioengineering"
January 16-18, 2013
Austin, Texas
Chair: John G. Ekerdt, The University of Texas at Austin
Sponsored by Department of Defense,
Office of the Assistant Secretary of Defense for Research and Engineering
Chemical and biological engineers use math, physics, chemistry, and biology to develop chemical transformations and processes, creating useful products and materials that improve society. In recent years, the boundaries between chemical engineering and bioengineering have blurred as biology has become molecular science, more seamlessly connecting with the historic focus of chemical engineering on molecular interactions and transformations.
This disappearing boundary creates new opportunities for the next generation of engineered systems – hybrid systems that integrate the specificity of biology with chemical and material systems to enable novel applications in catalysis, biomaterials, electronic materials, and energy conversion materials.
Basic research for the U.S. Department of Defense covers a wide range of topics such as metamaterials and plasmonics, quantum information science, cognitive neuroscience, understanding human behavior, synthetic biology, and nanoscience and nanotechnology. Future Directions workshops such as this one identify opportunities
for continuing and future DOD investment. The intent is to create conditions for discovery and transformation, maximize the discovery potential, bring balance and coherence, and foster connections. Basic research stretches the limits of today’s technologies and discovers new phenomena and know-how that ultimately lead to future technologies and enable military and societal progress.
Skoltech at a glance. What is the new type of university? What do we do? What differs us from traditional universities?
And - what it takes to become a Skoltech student?
This paper discusses the several research methodologies that can
be used in Computer Science (CS) and Information Systems
(IS). The research methods vary according to the science
domain and project field. However a little of research
methodologies can be reasonable for Computer Science and
Information System.
Machine Learning in Material Characterizationijtsrd
Machine learning has shown great potential applications in material science. It is widely used in material design, corrosion detection, material screening, new material discovery, and other fields of materials science. The majority of ML approaches in materials science is based on artificial neural networks ANNs . The use of ML and related techniques for materials design, development, and characterization has matured to a main stream field. This paper focuses on the applications of machine learning strategies for material characterization. Matthew N. O. Sadiku | Guddi K. Suman | Sarhan M. Musa "Machine Learning in Material Characterization" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-6 , October 2021, URL: https://www.ijtsrd.com/papers/ijtsrd46392.pdf Paper URL : https://www.ijtsrd.com/engineering/electrical-engineering/46392/machine-learning-in-material-characterization/matthew-n-o-sadiku
In this deck from the HPC User Forum, Rick Stevens from Argonne presents: AI for Science.
"Artificial Intelligence (AI) is making strides in transforming how we live. From the tech industry embracing AI as the most important technology for the 21st century to governments around the world growing efforts in AI, initiatives are rapidly emerging in the space. In sync with these emerging initiatives including U.S. Department of Energy efforts, Argonne has launched an “AI for Science” initiative aimed at accelerating the development and adoption of AI approaches in scientific and engineering domains with the goal to accelerate research and development breakthroughs in energy, basic science, medicine, and national security, especially where we have significant volumes of data and relatively less developed theory. AI methods allow us to discover patterns in data that can lead to experimental hypotheses and thus link data driven methods to new experiments and new understanding."
Watch the video: https://wp.me/p3RLHQ-kQi
Learn more: https://www.anl.gov/topic/science-technology/artificial-intelligence
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Knowledge Management in the AI Driven Scintific SystemSubhasis Dasgupta
In this dynamic talk, we'll explore the transformative role of AI in scientific knowledge management. We'll delve into how AI revolutionizes data organization, analysis, and hypothesis testing, enhancing efficiency and discovery. Highlighting the seamless integration with existing research processes, we'll address the training and ethical considerations of AI adoption. Through real-world examples, we'll demonstrate AI's impact on scientific breakthroughs, emphasizing the shift towards more collaborative and innovative research landscapes. This presentation aims to inspire the scientific community to embrace AI, leveraging its potential to redefine the boundaries of knowledge and innovation.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
A glimpse of how we are used to connecting datasets on our laptops and how, imho, need to move to the Web of Data, including a demo connecting various sources all from your(!) machine.
This paper discusses the several research methodologies that can
be used in Computer Science (CS) and Information Systems
(IS). The research methods vary according to the science
domain and project field. However a little of research
methodologies can be reasonable for Computer Science and
Information System.
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...Ichigaku Takigawa
Video https://youtu.be/P4QogT8bdqY
ACS Spring 2023 Symposium on AI-Accelerated Scientific Workflow
https://acs.digitellinc.com/acs/sessions/526630/view
ACS SPRING 2023 ———— Crossroads of Chemistry
Indianapolis, IN & Hybrid, March 26-30
https://www.acs.org/meetings/acs-meetings/spring-2023.html
Slide PDF
https://itakigawa.page.link/acs2023spring
Our Paper
Accelerated discovery of multi-elemental reverse water-gas shift catalysts using extrapolative machine learning approach (2022, ChemRxiv)
https://doi.org/10.26434/chemrxiv-2022-695rj
Ichi Takigawa
https://itakigawa.github.io/
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
The interplay between data-driven and theory-driven methods for chemical sciences
1. The interplay between data-driven and
theory-driven methods for chemical sciences
Ichigaku Takigawa
• Medical-risk Avoidance based on iPS Cells Team,
RIKEN Center for Advanced Intelligence Project (AIP)
• Institute for Chemical Reaction Design and Discovery
(WPI-ICReDD), Hokkaido University
ichigaku.takigawa@riken.jp
2. Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
3. Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
From AAAI-20 Oxford-Style Debate
4. Research Interests: Ichigaku Takigawa
• Machine learning and data mining technologies
• Data-intensive approaches to natural sciences
10 years Hokkaido University (Computer Sciences)
• Grad School. Engineering
7 years Kyoto University (Bioinformatics, Chemoinformatics)
• Bioinformatics Center, Inst. Chemical Research
• Grad School. Pharmaceutical Sciences
(1995-2004)
(2005-2011)
7 years Hokkaido University (Computer Sciences)
• Grad School. Information Science and Technology
• JST PRESTO on Materials Informatics (2015-2018)
(2012-2018)
? years RIKEN@Kyoto (Machine Learning + Stem Cell Biology)
• Center for AI Project(2019-)
Hokkaido University (Machine Learning + Chemistry)
• Inst. Chemical Reaction Design and Discovery
From AAAI-20 Oxford-Style Debate
Kyoto
Hokkaido
Sapporo
5. Machine Learning with "Discrete Structures"
Q. How we can make use of "discrete structures" for prediction...?
sets, sequences, branchings or hiearchies (trees), networks (graphs), relations,
logic, rules, combinations, permutations,, point clouds, algebra, langages, ...
H
H
H
H
H
H
H
H
O
N
O
O
H
H
H
O
O
H
H
N
O
O
Cl
ClCl
• The target variables can come with "discrete structures"
• The relationship between variables can have "discrete structures"
• The ML models themselves can involve "discrete structures"
6. Past work: Data-intensive approaches to life science
• Transcriptional regulation of metabolism
Bioinformatics 2007, 2008a, 2008b, 2009, 2010
Nucleic Acids Res 2011, PLoS One 2012, 2013, KDD'07
• Transcription regulation by mediator complex
Nat Commun 2015, Nat Commun 2020
• Repetitive sequences in genomes
Discrete Appl Math 2013, 2016, AAAI 2020
• Polypharmacology of drug-target interaction networks
PLoS One 2011, Drug Discov Today 2013, Brief Bioinform 2014, BMC
Bioinformatics 2020
• Copy number variations in neurological disorder (MSA, MS)
Mol Brain 2017
• Substrate analysis of modulator protease (Calpain family)
Mol Cell Proteom 2016, Genome Informatics 2009, http://calpain.org
• Cell competition in cancer
Cell Reports 2018, Sci Rep 2015
• Phenotype patterns of somatic mutations in breast cancer
Brief Bioinform 2014
7. Recent work: Data-intensive approaches to chemistry
Machine learning for heterogeneous catalysis
Haber–Bosch Process
Ferrous Metal Catalysis
NOx
CO
HC
N2
CO2
H2O
Exhaust Gas Harmless gas
Noble Metal Catalysis (Pt, Pd, Rh…)
• Ethane
• Ethylene
• Methanol
• :
Methane
Various Metallic Catalysts
(Li, rare earthes, alkaline earths)
(industrial synthesis of ammonia)
Exhaust Gas Purification Conversion of Methane
“Fertilizer from Air”
artificial nitrogen fixation
• ACS Catalysis 2020 (review)
• ChemCatChem 2019 (front cover)
• J Phys Chem C 2018a, 2018b, 2019 (cover)
• RSC Advances 2016 (highlighted in Chemical World)
catalysis where the phase of the catalyst differs from
the phase of the reactants or products.
Reactants in
the gas phase
Catalysts in
the solid phase
8. Effective use of data is another key in natural sciences
REVIEW
Inverse molecular design using
machine learning: Generative models
for matter engineering
Benjamin Sanchez-Lengeling1
and Alán Aspuru-Guzik2,3,4
*
The discovery of new materials can bring enormous societal and technological progress. In this
context, exploring completely the large space of potential materials is computationally
intractable. Here, we review methods for achieving inverse design, which aims to discover
tailored materials from the starting point of a particular desired functionality. Recent advances
from the rapidly growing field of artificial intelligence, mostly from the subfield of machine
learning, have resulted in a fertile exchange of ideas, where approaches to inverse molecular
design are being proposed and employed at a rapid pace. Among these, deep generative models
have been applied to numerous classes of materials: rational design of prospective drugs,
synthetic routes to organic compounds, and optimization of photovoltaics and redox flow
batteries, as well as a variety of other solid-state materials.
M
any of the challenges of the 21st century
(1), from personalized health care to
energy production and storage, share a
common theme: materials are part of
the solution (2). In some cases, the solu-
mapped 166.4 billion molecules that contain at
most 17 heavy atoms. For pharmacologically rele-
vant small molecules, the number of structures is
estimated to be on the order of 1060
(9). Adding
consideration of the hierarchy of scale from sub-
act properties. In practice, approximations are
used to lower computational time at the cost of
accuracy.
Although theory enjoys enormous progress,
now routinely modeling molecules, clusters, and
perfect as well as defect-laden periodic solids, the
size of chemical space is still overwhelming, and
smart navigation is required. For this purpose,
machine learning (ML), deep learning (DL), and
artificial intelligence (AI) have a potential role
to play because their computational strategies
automatically improve through experience (11).
In the context of materials, ML techniques are
often used for property prediction, seeking to
learn a function that maps a molecular material
to the property of choice. Deep generative models
are a special class of DL methods that seek to
model the underlying probability distribution of
both structure and property and relate them in a
nonlinear way. By exploiting patterns in massive
datasets, these models can distill average and
salient features that characterize molecules (12, 13).
Inverse design is a component of a more
complex materials discovery process. The time
scale for deployment of new technologies, from
discovery in a laboratory to a commercial pro-
duct, historically, is 15 to 20 years (14). The pro-
cess (Fig. 1) conventionally involves the following
steps: (i) generate a new or improved material
FRONTIERS IN COMPUTATION
httpDownloadedfrom
REVIEW https://doi.org/10.1038/s41586-018-0337-2
Machine learning for molecular and
materials science
Keith T. Butler1
, Daniel W. Davies2
, Hugh Cartwright3
, Olexandr Isayev4
* & Aron Walsh5,6
*
Here we summarize recent progress in machine learning for the chemical sciences. We outline machine-learning
techniques that are suitable for addressing research questions in this domain, as well as future directions for the field.
We envisage a future in which the design, synthesis, characterization and application of molecules and materials is
accelerated by artificial intelligence.
T
he Schrödinger equation provides a powerful structure–
property relationship for molecules and materials. For a given
spatial arrangement of chemical elements, the distribution of
electrons and a wide range of physical responses can be described. The
development of quantum mechanics provided a rigorous theoretical
foundationforthechemicalbond.In1929,PaulDiracfamouslyproclaimed
that the underlying physical laws for the whole of chemistry are “completely
known”1
. John Pople, realizing the importance of rapidly developing
generating, testing and refining scientific models. Such techniques are
suitable for addressing complex problems that involve massive combi-
natorial spaces or nonlinear processes, which conventional procedures
either cannot solve or can tackle only at great computational cost.
As the machinery for artificial intelligence and machine learning
matures, important advances are being made not only by those in main-
stream artificial-intelligence research, but also by experts in other fields
(domain experts) who adopt these approaches for their own purposes. As
DNA to be sequences into distinct pieces,
parcel out the detailed work of sequencing,
and then reassemble these independent ef-
forts at the end. It is not quite so simple in the
world of genome semantics.
Despite the differences between genome se-
quencing and genetic network discovery, there
are clear parallels that are illustrated in Table 1.
In genome sequencing, a physical map is useful
to provide scaffolding for assembling the fin-
ished sequence. In the case of a genetic regula-
tory network, a graphical model can play the
same role. A graphical model can represent a
high-level view of interconnectivity and help
isolate modules that can be studied indepen-
dently. Like contigs in a genomic sequencing
project, low-level functional models can ex-
plore the detailed behavior of a module of genes
in a manner that is consistent with the higher
level graphical model of the system. With stan-
dardized nomenclature and compatible model-
ing techniques, independent functional models
can be assembled into a complete model of the
cell under study.
To enable this process, there will need to
be standardized forms for model representa-
tion. At present, there are many different
modeling technologies in use, and although
models can be easily placed into a database,
they are not useful out of the context of their
specific modeling package. The need for a
standardized way of communicating compu-
tational descriptions of biological systems ex-
tends to the literature. Entire conferences
have been established to explore ways of
mining the biology literature to extract se-
mantic information in computational form.
Going forward, as a community we need
to come to consensus on how to represent
what we know about biology in computa-
tional form as well as in words. The key to
postgenomic biology will be the computa-
tional assembly of our collective knowl-
edge into a cohesive picture of cellular and
organism function. With such a comprehen-
sive model, we will be able to explore new
types of conservation between organisms
and make great strides toward new thera-
peutics that function on well-characterized
pathways.
References
1. S. K. Kim et al., Science 293 , 2087 (2001).
2. A. Hartemink et al., paper presented at the Pacific
Symposium on Biocomputing 2000, Oahu, Hawaii, 4
to 9 January 2000.
3. D. Pe’er et al., paper presented at the 9th Conference
on Intelligent Systems in Molecular Biology (ISMB),
Copenhagen, Denmark, 21 to 25 July 2001.
4. H. McAdams, A. Arkin, Proc. Natl. Acad. Sci. U.S.A.
94 , 814 ( 1997 ).
5. A. J. Hartemink, thesis, Massachusetts Institute of
Technology, Cambridge (2001).
V I E W P O I N T
Machine Learning for Science: State of the
Art and Future Prospects
Eric Mjolsness* and Dennis DeCoste
Recent advances in machine learning methods, along with successful
applications across a wide variety of fields such as planetary science and
bioinformatics, promise powerful new tools for practicing scientists. This
viewpoint highlights some useful characteristics of modern machine learn-
ing methods and their relevance to scientific applications. We conclude
with some speculations on near-term progress and promising directions.
Machine learning (ML) (1) is the study of
computer algorithms capable of learning to im-
prove their performance of a task on the basis of
their own previous experience. The field is
closely related to pattern recognition and statis-
tical inference. As an engineering field, ML has
become steadily more mathematical and more
successful in applications over the past 20
years. Learning approaches such as data clus-
tering, neural network classifiers, and nonlinear
regression have found surprisingly wide appli-
cation in the practice of engineering, business,
and science. A generalized version of the stan-
dard Hidden Markov Models of ML practice
have been used for ab initio prediction of gene
structures in genomic DNA (2). The predictions
correlate surprisingly well with subsequent
gene expression analysis (3). Postgenomic bi-
ology prominently features large-scale gene ex-
pression data analyzed by clustering methods
(4), a standard topic in unsupervised learning.
Many other examples can be given of learning
and pattern recognition applications in science.
Where will this trend lead? We believe it will
lead to appropriate, partial automation of every
element of scientific method, from hypothesis
generation to model construction to decisive
experimentation. Thus, ML has the potential to
amplify every aspect of a working scientist’s
progress to understanding. It will also, for better
or worse, endow intelligent computer systems
with some of the general analytic power of
scientific thinking.
creating hypotheses, testing by decisive exper-
iment or observation, and iteratively building
up comprehensive testable models or theories is
shared across disciplines. For each stage of this
abstracted scientific process, there are relevant
developments in ML, statistical inference, and
pattern recognition that will lead to semiauto-
matic support tools of unknown but potentially
broad applicability.
Increasingly, the early elements of scientific
method—observation and hypothesis genera-
tion—face high data volumes, high data acqui-
sition rates, or requirements for objective anal-
ysis that cannot be handled by human percep-
tion alone. This has been the situation in exper-
imental particle physics for decades. There
automatic pattern recognition for significant
events is well developed, including Hough
transforms, which are foundational in pattern
recognition. A recent example is event analysis
for Cherenkov detectors (8) used in neutrino
oscillation experiments. Microscope imagery in
cell biology, pathology, petrology, and other
fields has led to image-processing specialties.
So has remote sensing from Earth-observingMachine Learning Systems Group, Jet Propulsion Lab-
Table 1. Parallels between genome sequencing
and genetic network discovery.
Genome
sequencing
Genome semantics
Physical maps Graphical model
Contigs Low-level functional
models
Contig
reassembly
Module assembly
Finished genome
sequence
Comprehensive model
C O M P U T E R S A N D S C I E N C E
onAugust29,2018http://science.sciencemag.org/Downloadedfrom
Nature, 559
pp. 547–555 (2018)
Science, 293
pp. 2051-2055 (2001)
Science, 361
pp. 360-365 (2018)
Science is changing, the tools of science are changing. And
that requires different approaches. ── Erich Bloch, 1925-2016
(alongside experiments and simulations)
And please keep in mind that unplanned data collection is too risky.
We need right designs for data collection and right tools to analyze.
A bitter lesson: "low input, high throughput, no output science." (Sydney Brenner)
9. Today's AI has stark limitations, facing big problems
• Deep learning techniques thus far have proven to be data hungry, shallow,
brittle, and limited in their ability to generalize (Marcus, 2018)
• Current machine learning techniques are data-hungry and brittle—they can
only make sense of patterns they've seen before. (Chollet, 2020)
• A growing body of evidence shows that state-of-the-art models learn to exploit
spurious statistical patterns in datasets... instead of learning meaning in the
flexible and generalizable way that humans do. (Nie et al., 2019)
• Current machine learning methods seem weak when they are required to
generalize beyond the training distribution, which is what is often needed in
practice. (Bengio et al., 2019)
Though it's extremely powerful and useful in suitable applications!
10. The AI/ML community is now struggling with these...
• The Winograd Schema Challenge (Levesque et al., 2011)
A collection of 273 multiple-choice problems that requires commonsense
reasoning to solve as an alternative to Turing test (Turing, 1950).
Many methods passed this test of "Imitation Games" were
actually tricks being fairly adept at fooling humans...
Unexpectedly, it turned out that several tricks using BERT/Transformer
can crack many of these (with >90% accuracy) ...
• WinoGrande (Sakaguchi et al., 2020)
An improved collection of 44k very hard problems
(AAAI-20 Outstanding Paper Award)
Though some top-ranking methods can have 70-80% accuracy ...
• Abstraction and Reasoning Challenge at Kaggle (Chollet, 2020)
"Kaggle’s most important AI competition" "Impossible AI Challenge"
At the moment, to what extent we can create an AI capable of solving
reasoning tasks it has never seen before?
11. Examples
WSC-273
WinoGrande-1.1
(Q0) The town councillors refused to give the angry demonstrators
a permit because they feared violence. Who feared violence?
• the town councillors
• the angry demonstrators
(Q1) The trophy would not fit in the brown suitcase because it was
so small. What was so small?
• the trophy
• the brown suitcase
(due to Terry Winograd)
(Q) He never comes to my home, but I always go to his house
because the ______ is smaller.
1. home
2. house
12. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
13. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
14. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
15. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and
their flexible transfer
We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
16. Take home message
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation.
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and
their flexible transfer
We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
17. They are trantitions from a valley to a hill
to another valley, but solving a quantum
chemical equation is needed at every
point to get the energy surface
Chemistry has the first principle at the electron level
θ1
θ2
Schrödinger equation
Potential Energy Surface (PES)
Outcomes we see are just
discrete and combinatorialstable state 1 transition state stable state 2
Energy
Every substance is a combination of
only 118 elements in the periodic table.
Our end goal:
take full control over chemical reactions,
the basis for transforming substances to
another (energy, materials, foods, colors, ...)
Recombinations of atoms and
chemical bonds are subjected to
the laws of nature
19. But chemistry still remains quite empirical, and why?
Even though current chemical calculations are fairly accurate,
chemists would still heavily rely on "Edisonian empiricism (trial & error)."
1. empirical knowledge (from the literature) and their flexible transfer
2. intuitions and experiences (unverbalizable common sense of experts)
20. But chemistry still remains quite empirical, and why?
Even though current chemical calculations are fairly accurate,
chemists would still heavily rely on "Edisonian empiricism (trial & error)."
1. empirical knowledge (from the literature) and their flexible transfer
2. intuitions and experiences (unverbalizable common sense of experts)
Takeaway: We also need 'data-driven' bridges!
first principles are not enough for us to throw away these empirical
things; data-driven approaches (ML) play a complementary role!
Fusing multi-disciplinary methods...?
1. Theory-driven: (slow but) high-fedelity quantum-chemical simulations
2. Knowledge-driven: explicit knowns and inference
3. Data-driven: empirical prediction grounded upon multifaceted data
21. Indeed computational chemistry has many limitations...
Chemical reactions = recombinations of atoms and chemical bonds
subjected to the laws of nature
• Intractably large chemical space: A intractably large number of
"theoretically possible" candidates for reactions and compounds...
• Computing time and scalability issue: Simulating an Avogadro-constant
number of atoms is utterly infeasible... (we need some compromise here)
• Complexity and uncertainty of real-world systems:
Many uncertain factors and arbitrary parameters are involved...
• Known and unknown imperfections of currently established theories:
Current theoretical calculations have many exceptions and limitations...
22. Yet another approach: Data-driven
Cause-and-Effect
Relationship
Related factors
(and their states)
Outcome
Reactions
Some
mechanism
[Inputs] [Outputs]
Theory-driven methods try to explicitly model the inner workings
of a target phenomenon (e.g. through first-principles simulations)
Data-driven methods try to precisely approximate its outer behavior
(the input-output relationship) observable as "data".
(e.g. through machine learning from a large collection of data)
based on very different principles and quite complementary!
governing equation?
23. Machine Learning (ML)
Generic Object Recognition
Speech Recognition
Machine Translation
QSAR/QSPR Prediction
AI Game Players
“ありがとう”
J’aime la
musique I love music
CH3
N
N
H
N
H
H3C
N
Growth inhibition
0.739
A new style of programming
a technique to reproduce a transformation process (or function) where
the underlying principle is unclear and hard to be explicitly modelled
just by giving a lot of input-output examples.
24. How ML works: fitting a function to data
Inputs Outputs
ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
A function best fitted to
a given set of example input-output pairs
(the training data).
(x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit>
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
interpolative
prediction
(High-dimensional)
HighLow Model Complexity
Underfitting
(High bias, Low variance)
Overfitting
(Low bias, High variance)
"The bias-variance tradeoff"
The training data
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
extrapolative
prediction
25. How ML works: fitting a function to data
Inputs Outputs
ML modelx<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
A function best fitted to
a given set of example input-output pairs
(the training data).
(x1, y1), (x2, y2), . . . , (xn, yn)<latexit sha1_base64="l6cITD180CW4htD4CpRHsG+/qlU=">AAAC+3ichVE9T9xAEB1MSMCQcCENUppTTiCQTqdZJwoHFSINJV8HSBidbLMcK/wle++Uw7o/kD9AQQUoBYE2Tdo0/IEUdLRASaQ0FMz6fCCKI2PZ+/bNvNm3Hjt0RSwRL3q03hd9L1/1D+iDQ6/fDOfejqzGQT1yeMUJ3CBat62Yu8LnFSmky9fDiFue7fI1e/eLyq81eBSLwF+RzZBvelbNF9vCsSRR1dyKaYuaayb6hGl7yddWlRUVaBKYLD6QRoc0iDS3AhnrxU7O7+T8SV01i8xWNVfAEiIyxvIKsKnPSGB6umywcp6pFEUBslgIcudgwhYE4EAdPODggyTsggUxPRvAACEkbhMS4iJCIs1zaIFO2jpVcaqwiN2lb412Gxnr0171jFO1Q6e49EakzMMY/sETvMVzPMUrvOvaK0l7KC9NWu22lofV4W+jy//+q/JolbDzqHrWs4RtKKdeBXkPU0bdwmnrG3v7t8szS2PJOB7hDfk/xAv8TTfwG3+d74t86eAZPzZ56f7HVD6roBF25pTvDlaNEvtYMhY/FWbnsmH2w3v4ABM0sSmYhXlYgAqd8Asu4QqutZZ2rP3QztqlWk+meQdPQvt5D2QXt/E=</latexit>
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
✓<latexit sha1_base64="lTDMb0MkD19hJd3A01US8Q9hJ/8=">AAACsHichVG9TgJBEB7PP/wFtTGxIRKMFZk7EcGKaGMJIkoChNydK5zeX+4WEiS8gL2xMNFoYmF8DBtewMJHMJaa2Fg4d5wxFuhsdnf2m/lmv91RbF1zOeLzkDA8Mjo2HpqYnJqemQ1H5ub3XavpqKyoWrrllBTZZbpmsiLXuM5KtsNkQ9HZgXKy7cUPWsxxNcvc422bVQ25bmpHmipzgsoVxehUeINxuVuLxDCRSacwKUUxgZiWMEXOOooZMRMVCfEsBoHlrEgPKnAIFqjQBAMYmMDJ10EGl0YZRECwCatChzCHPM2PM+jCJHGblMUoQyb0hNY6ncoBatLZq+n6bJVu0Wk6xIxCHJ/wHt+whw/4gp8Da3X8Gp6WNu1Kn8vsWvhssfDxL8ugnUPjh/WnZg5HkPa1aqTd9hHvFWqf3zq9eCts7sY7K3iLr6T/Bp/xkV5gtt7VuzzbvfxDj0JaBv+YFw8yqIXffYoOdvalhLiWkPLJWHYraGYIlmAZVqljG5CFHchB0f/zc7iCa0ESSkJNkPupwlDAWYBfJhx/AVqsnB0=</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
x<latexit sha1_base64="BkOuic6isW1cYY2ZNWiUOdOU/tM=">AAACq3ichVFNLwNRFD3G92eLjcRGNMRG8wYJsWrYWJbqR2jTzIxXHvOVmdcGTf+AlZ1gRWIhfoaNP2DhJ4gliY2FO9NJBFF3MvPOO/ee+86bq7um8CVjT21Ke0dnV3dPb1//wOBQLD48kvOdqmfwrOGYjlfQNZ+bwuZZKaTJC67HNUs3eV4/WA3y+Rr3fOHYm/LI5SVL27VFRRiaJCpX1K36YaMcT7AkC2PiN1AjkEAUaSf+gCJ24MBAFRY4bEjCJjT49GxDBYNLXAl14jxCIsxzNNBH2ipVcarQiD2g7y7ttiPWpn3Q0w/VBp1i0uuRcgJT7JHdslf2wO7YM/v4s1c97BF4OaJVb2q5W46djGXe/1VZtErsfalaepaoYCn0Ksi7GzLBLYymvnZ89ppZ3piqT7Nr9kL+r9gTu6cb2LU342adb1y28KOTl7//WJCPKmiE6s+B/Qa5uaQ6n5xbX0ikVqJh9mAck5ihiS0ihTWkkaUT9nGKc1wos0pG2VKKzVKlLdKM4lso/BN9vpmr</latexit>
y<latexit sha1_base64="getvLqfzl+lmP3jVELri0P4Sr2g=">AAACq3ichVG7SgNBFD1ZX/EdtRFsxBCxMUxUUKxEG0tNzAMTkd11Ekf3xe4kEEN+wMpO1ErBQvwMG3/Awk8Qywg2Ft7dLIgG9S67c+bce+6c2as5hvAkY88Rpau7p7cv2j8wODQ8MhobG895dtXVeVa3DdstaKrHDWHxrBTS4AXH5aqpGTyvHW/4+XyNu56wrR1Zd/ieqVYsURa6KonKlTSzUW/ux+IsyYKY7gSpEMQRxpYde0QJB7ChowoTHBYkYQMqPHqKSIHBIW4PDeJcQiLIczQxQNoqVXGqUIk9pm+FdsWQtWjv9/QCtU6nGPS6pJxGgj2xO9Zij+yevbCPX3s1gh6+lzqtWlvLnf3R08nM+78qk1aJwy/Vn54lylgJvAry7gSMfwu9ra+dnLcyq+lEY5bdsFfyf82e2QPdwKq96bfbPH31hx+NvPz+x/x8WEEjTP0cWCfILSRTi8mF7aX42no4zCimMIM5mtgy1rCJLWTphCOc4QKXyrySUXaVUrtUiYSaCXwLhX8CgAGZrA==</latexit>
interpolative
prediction
(High-dimensional)
HighLow Model Complexity
Underfitting
(High bias, Low variance)
Overfitting
(Low bias, High variance)
"The bias-variance tradeoff"
The training data
f(x; ✓)<latexit sha1_base64="33zlDWOHXmZvZwuGF4JM5OwSths=">AAAC53ichVFNT9RQFD1UEUSUUTckbhonGEzI5HYEnJHNRDcu+RogmZlM2vpmeKFfad9MxKZ/wI07NXGFiSaGn8GGhVtM+AmEJSRuXHjbKSHGDNymfeede8995/VagSMjRXQ8ot24OXprbPz2xJ3Ju/emCvcfbER+L7RF3fYdP9yyzEg40hN1JZUjtoJQmK7liE1r51Wa3+yLMJK+t652A9Fyza4nO9I2FVPtQq0zGzezNo2wa7ViKlEWc/+BRG9abvw2SZYy0FTbQpnJ03ahSKVqZZHmy3paWinTIoMFMqpGVTdycRF5LPuFQzTxBj5s9OBCwINi7MBExE8DBggBcy3EzIWMZJYXSDDB2h5XCa4wmd3hb5d3jZz1eJ/2jDK1zac4/Ias1DFDR/SDzuiQ9umE/gztFWc9Ui+7vFoDrQjaU++n135fq3J5Vdi+VF3pWaGDSuZVsvcgY9Jb2AN9/92ns7UXqzPxE/pKp+x/j47pgG/g9c/tbyti9csVfiz2MvyPpfm8gkd4MSd9ONgol4xnpfLKfLH2Mh/mOB7hMWZ5Ys9Rw2sso84nfMdPHOGXJrUP2kft86BUG8k1D/FPaHt/AeNtrV0=</latexit>
extrapolative
prediction
e.g.) Deep learning usually comes with overparametrization
and the number of parameters can be several tens of millions
or even billions... (This is why model-complexity control or
regularization is so critical and essential in ML)
This can be often ultra high-dimensional in modern ML
26. Multilevel representations of chemical reactions
Brc1cncc(Br)c1 C[O-] CN(C)C=O Na+ COc1cncc(Br)c1SMILES
Structural Formla
Steric Structures
Electronic States
Reactants Reagents Products
As pattern languages (e.g. known facts in textbooks/databases)
As physical entities (e.g. quantum chemical calculations)
27. Computer-assisted synthetic planning
Corey+ 1972
J Am Chem Soc (JACS), 94(2), 1972.
440
Computer-Assisted Synthetic Analysis for Complex Molecules.
Methods and Procedures for Machine Generation of
Synthetic Intermediates
E. J. Corey,* Richard D. Cramer III, and W. Jeffrey Howe
Contribution from the Department of Chemistry, Harvard University,
Cambridge, Massachusetts 02138. Received January 30, 1971
Abstract: A classification of synthetic reactions is outlined which is suitable for use in a machine program to
generate a tree of synthetic intermediates starting from a given target molecule. The generation of a particular
intermediate by the program involves the search of appropriate data tables of synthetic processes, the search being
driven by the information obtained by machine perception of the parent structure and certain basic strategies.
Procedures have been developed for the evaluation of chemical interconversions which allow the effective exclusion
of invalid or naive structures. The paper provides a view of the status of computer-assisted synthetic problem
solving as of 1970.
The
communication of chemical structural informa-
tion to and from a digital computer by graphical
methods has been discussed in detail in a foregoing
paper,1 as has the machine representation and percep-
tion of key features within structures,2 as for example,
functional groups and rings. This paper is concerned
with the ways in which the structural information made
available by the perception process can be utilized to
generate a tree of chemical structures3 which represent
possible synthetic intermediates for the construction of
a complex target molecule. More specifically, the
following topics will be treated: (1) classification of
and necessary control strategies, and also for eventual
inclusion of a fairly complete collection of families.
In the discussion which follows, the degree of imple-
mentation of each area of study will be cited.
A variety of rational schemes for creating families of
synthetic reactions already exists. However, most of
these depend on properties of the reactants,5 and as
such they are irrelevant to a computer program which
analyzes the features of a target or product molecule
in order to generate appropriate starting materials.
One very general treatment of synthetic reactions in-
volves classification on the basis of the structural
Computer-Assisted Solution of Chemical Problems-
The Historical Development and the Present State of the Art
of a New Discipline of Chemistry
By Ivar Ugi," Johannes Bauer, Klemens Bley, Alf Dengler, Andreas Dietz,
Eric Fontain, Bernhard Gruber, Rainer Herges, Michael Knauer, Klaus Reitsam,
and Natalie Stein
Dedicated to Projkssor Karl-Heinz Biicliel
The topic of this article is the development and the present state of the art of computer
chemistry, the computer-assisted solution of chemical problems. Initially the problems in
computer chemistry were confined to structure elucidation on the basis of spectroscopic data,
then programs for synthesis design based on libraries of reaction data for relatively narrow
classes of target compounds were developed, and now computer programs for the solution of
a great variety of chemical problems are available or are under development. Previously it was
an achievement when any solution of a chemical problem could be generated by computer
assistance. Today, the main task is the efficient, transparent, and non-arbitrary selection of
meaningful results from the immense set of potential solutions--that also may contain innova-
tive proposals. Chemistry has two aspects, constitutional chemistry and stereochemistry,
which are interrelated, but still require different approaches. As a result, about twenty years
ago, an algebraic model of the logical structure of chemistry was presented that consisted of
two parts: the constitution-oriented algebra of be- and r-matrices, and the theory of the
Ugi+ 1993
Angew Chem Int Ed Engl. 32, 202-227, 1993.
A traditional chemoinformtics topic since 70s:
collect known chemical reactions, and search for reaction paths
e.g.) Representation, classification, exploration of chemical reactions
29. Purely ML-based approaches and more
ML-based chemical reaction predictions
Fermionic Neural Network
Pfau+ Ab-Initio Solution of the Many-Electron Schrödinger Equation with Deep Neural Networks.
arXiv:1909.02487, Sep 2019.
Hamiltonian Graph Networks with ODE Integrators
Sanchez-Gonzalez+ Hamiltonian Graph Networks with ODE Integrators.
arXiv:1909.12790, Sep 2019.
Both from
ML + First-principle simulations
3N-MCTS/AlphaChem
Segler+ Nature 2018
Molecular Transformer
Schwaller+ ACS Cent Sci
2019
seq2seq
Liu+ ACS Cent Sci 2017
WLDN
Jin+ NeurIPS 2017
ELECTRO
Bradshaw+ICLR 2019
WLN
Coley+ Chem Sci 2019
GPTN
Do+ KDD 2019
Graph Neural Networks Sequence Neural Networks Combined or Other
IBM RXN
Schwaller+ Chem Sci 2018
Molecule Chef
Bradshaw+ DeepGenStruct
(ICLR WS) 2019
Neural-Symbolic ML
Segler+ Chemistry 2017
Similarity-based
Coley+ ACS Cent Sci 2017
GLN
Dai+ NeurIPS2019
Transformer
Karpov+ ICANN 2019
30. Similar technologies go for biology!
DeepMind's AlphaFold paper
A successful example from MIT MLPDS
31. How to integrate data-driven and theory-driven?
• Theory-driven: first-principles simulations, logical reasoning,
mathematical models, etc.
• Data-driven: machine learning
Still emerging topics,
but examples of already available approaches are
1. Wrapper: Use ML to control or plan simulations
Basically we solve the problem by simulations, but use ML models to
ask "what if?" questions to guide what simulations to run.
(Model-based optimization, Sequential design of experiments,
Reinforcement learning, Generative methods, etc)
2. Hybrid: Use ML as approximator for unsure parts of simulations
Plugging ML models into the unsure part of simulations or
calling simulations when ML predictions are less confident
(Data assimilation, domain adaptation, semi-empirical methods, etc)
32. Back to our example: Heterogeneous Catalysis
Wolfgang Pauli
“God made the bulk;
the surface was invented by the devil.”
adsorption
diffusion
desorption
dissociation
recombination
kinks
terraces
adatom
vacancysteps
Gas-Phase
(Reactants)
Sold-Phase
(Catalysts)
Many hard-to-quantify intertwined factors involved.
Too complicated (impossible?) to model everything...
• multiple elementary
reaction processes
• composition, support,
surface termination,
particle size, particle
morphology, atomic
coordination environment
• reaction conditions
Notoriously complex surface reactions between different phases.
33. Our ML-based case studies
1. Can we predict the d-band center?
2. Can we predict the adsorption energy?
3. Can we predict the catalytic activity?
predicting DFT-calculated values by machine learning
(Takigawa et al, RSC Advances, 2016)
predicting DFT-calculated values by machine learning
(Toyao et al, JPCC, 2018)
predicting values from experiments reported in the
literature by machine learning
(Suzuki et al, ChemCatChem, 2019)
34. • Problem: Very strong "selection bias" in existing datasets
Catalyst research has relied heavily on prior published data,
strongly biased toward catalyst composition that were successful
One of big problems we had
Example) Oxidative coupling of methane (OCM)
• 1868 catalysts in the original dataset [Zavyalova+ 2011]
• Composed of 68 different elements: 61 cations and 7 anions
(Cl, F, Br, B, S, C, and P) excluding oxygen
• Occurrences of only a few elements such as La, Ba, Sr, Cl,
Mn, and F are very high.
• Widely used elements such as Li, Mg, Na, Ca, and La also
frequent in the data
35. An ML model is just representative of the training data
Highly Inaccurate Model Predictions
from Extrapolation (Lohninger 1999)
"Beware of the perils of extrapolation,
and understand that ML algorithms
build models that are representative of
the available training samples."
Simply focusing on targets predicted as high-performance ones
by ML built on currently avialable data is clearly not a good idea...
Like too much scrutinizing any so-so candidates at the moment
that we stumbled upon at the very early stage of research...
36. An ML model is just representative of the training data
Highly Inaccurate Model Predictions
from Extrapolation (Lohninger 1999)
"Beware of the perils of extrapolation,
and understand that ML algorithms
build models that are representative of
the available training samples."
Simply focusing on targets predicted as high-performance ones
by ML built on currently avialable data is clearly not a good idea...
Like too much scrutinizing any so-so candidates at the moment
that we stumbled upon at the very early stage of research...
In reality, distinction between
interpolation and extrapolation
are not that clear due to the
high-dimensionality.
37. No guarantee of data-driven for the outside of given data
Cause-and-Effect
Relationship
Related factors
(and their states)
Outcome
Reactions
Some
mechanism
[Inputs] [Outputs]
Data-driven methods try to precisely approximate its outer behavior
(the input-output relationship) observable as "data".
(e.g. through machine learning from a large collection of data)
Keep in mind: Given data DEFINES the data-driven prediction!
"Current machine learning techniques are data-hungry and brittle
—they can only make sense of patterns they've seen before."
(Chollet, 2020)
38. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction
(function fitting)
amount
we want to
maximize
max(best)
for now
39. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction
(function fitting)
prediction variance
(uncertainty)
amount
we want to
maximize
max(best)
for now
40. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
ML prediction
(function fitting)
prediction variance
(uncertainty)
amount
we want to
maximize
max(best)
for now
very confident
(we have data)
not confident
(no data around)
41. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
e.g.
"expected
improvement"
ML prediction
(function fitting)
prediction variance
(uncertainty)
amount
we want to
maximize
max(best)
for now
42. Our Solution: Model-based optimization
Use ML to guide the balance between "exploitation" and "exploration"!
• Exploitation:
what we already know and get something close to what we expect
• Exploration:
something we aren't sure about and possibly learn more
Either through experiments or simulations,
we face a dilemma in choosing between options to learn new things.
e.g.
"expected
improvement"
ML prediction
(function fitting)
prediction variance
(uncertainty)
probing area with no-data or scarse data relying on available data
amount
we want to
maximize
max(best)
for now
43. AlphaGo
(Nature, Jan 2016)
AlphaGo Zero
(Nature, Oct 2017)
AlphaZero
(Science, Dec 2018)
• Algorithm Configuration
• Hyperparameter Optimization (HPO)
• Neural Architecture Search (NAS)
• Meta Learning / Learning to Learn Amazon
SageMaker
MuZero
(arXiv, Nov 2019)
Key: model-based and exploitation-exploration balance
Model-based reinforcement learning
AutoML (Use ML for tuning ML)
44. Facts from experiments and calculations
Rationalize & Accelerate Chemical Design and Discovery
In-House data + Public data + Knowledge base
(and their quality control & annotations)
Hypothesis generation Validation
(Experiments and/or Simulations)(Machine learning, Data mining)
• Planning what to test in the next
experiments or simulations
• Surrogates to expensive or time-
consuming experiments or simulations
• Optimize uncertain factors or conditions
• Multilevel information fusion
• Highly Reproducible experiments with
high accuracies and speeds
• Acceleration with ML-based surrogates
for time-consuming subproblems
• Simulating many 'what-if' situations
Key: Effective use of data with a help with data-driven techniques
46. Next expanded to materials science
Little human intervention for highly
reproducible large-scale production lines
Automation, monitoring with IoT, and big-data
management are also the key to manufacturing.
Now these focuses shifted to the R & D phases.
(very experimental and empirical traditionally)
47. Next expanded to materials science
Little human intervention for highly
reproducible large-scale production lines
Automation, monitoring with IoT, and big-data
management are also the key to manufacturing.
Now these focuses shifted to the R & D phases.
(very experimental and empirical traditionally)
49. Summary
• Model-based learning, Neuro-symbolic or ML-GOFAI integration.
We can partly use explicit models for well-understood parts for sample
efficiency and for filling the gap between correlation and causation
• Needs for modeling unverbalizable comon sense of domain experts
We need a good strategy for building a gigantic data collection for this,
as well as self-supervised learning and/or meta learning algorithms.
• Needs for novel techniques for compositionality (combinatorial
generalization), out-of-distribution prediction (extrapolation), and
their flexible transfer
We need to somehow combine and flexibly transfer partial knowledge
to generate a new thing or deal with completely new situations.
"Self-supervised is training a model to fill in the blanks. This is what is going to allow our AI
systems to go to the next level. Some kind of common sense will emerge." (Yann LeCun)
The current "end-to-end" or "fully data-driven" strategy of ML is too
data-hungry. But in many cases, we cannot have enough data for
various practical restrictions (cost, time, ethics, privacy, etc).
50. Reviewer: 1
I don't usually recommend that papers should be accepted "as is", but in this case I don't see
the need for changes. This review should be accepted and published in ACS Catalysis. ... I will
certainly recommend it to my group and my students when it is published.
Reviewer: 2
The manuscript gives an excellent over the field of machine learning especially with regard to
heterogeneous catalysis and I would highly recommend the article for the publication in ACS
Catalysis.
Reviewer: 3
This is one of the best reviews for catalyst informatics that the Reviewer has read. In particular,
the chapter 2 delivers a very good tutorial, which is concisely and professionally written.
Chapter 2 is the general user's guide of ML for natural sciences.
ACS Catalysis, 2020; 10: 2260-2297.