The PubChemQC Project
A big data construction by first-
principles calculations of molecules
中田真秀 (NAKATA Maho)
ACCC RIKEN
2016/2/17 15:50-16:40
Kobe workshop for material design on
strongly correlated electrons in molecules
and materials
http://www.aics.riken.jp/labs/cms/workshop/201602/index.html
Background
• Atoms and molecules are all composed of matter.
• A dream of theoretical chemist: do chemistry without
experiment!
• On computers 
• Chemical space is really huge!
– The number of candidates for drugs
1060http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/a
bstract)
• Cf. Exa: 1018
– Combinatorics problem
– Adding chemical reaction 10120
Why 2-RDM theory has been
suspended?
• Is there short cut for solving Schrodinger Eq?
– Density functional theory, reduced density matrix theory
• Using 2-particle reduced density matrices, we can
reduce the number of variables drastically.
– Journal of Chemical Physics, 114, 8282-8292 (2001).
Introduction of semidifinite programming
– Computational and Theoretical Chemistry Volume 1003, 1
January 2013, Pages 22-7 Application to 2D Hubbard model
– Journal of chemical physics 128, 16 164113 (2008). Variouls
molecules
• However it is not size-consistent, nor size-extensive.
– Phys. Chem. Chem. Phys., 2009,11, 5558-5560
– AIP Advances 2, 032125 (2012)
– Physical Review A 80, 042109 (2009)
Fundamental question to solving SCE…
• Does this problem can be solved efficiently?
– Very likely NO!
– Example. spinglass Hamiltonian is very hard to
solve: this is as hard as solving Traveling
Salesperson Problem
– Algorithms without
assumption on 2-particle
interaction are never efficient.
Fundamental question to solving SCE…
Results from computational complexity theory
• N-representability problem is QMA-hard
– Liu, Y.-K., Christandl, M. & Verstraete, F. Quantum computational complexity of the n-representability problem: Qma complete. Phys. Rev.
Lett. 98, 110503 (2007).
• Solving 2-local Hamiltonian is also QMA-hard
– The Complexity of the Local Hamiltonian Problem
– SIAM J. Comput., 35(5), 1070–1097. http://epubs.siam.org/doi/abs/10.1137/S0097539704445226
• finding the ground-state energy of the Hubbard model
in an external magnetic field is still QMA-hard
– http://www.nature.com/nphys/journal/v5/n10/abs/nphys
1370.html
• Good review:Computational Complexity in Electronic
Structure
– http://arxiv.org/abs/1208.3334
Fundamental question to solving SCE…
• What I have learned
– No algorithm to solve general 2-particle Hamiltonian
efficiently.
– No algorithm to solve electronic Hamiltonian efficiently
(maybe)
– Introduction of other conditions on 2-particle interaction
are mandatory.
Heuristics is much more important than
thinking about subtle shortcut.
Current status of computational
chemistry
• Relatively good agreements with experiments.
• Can explain chemical phenomena
– Many good quantum chemistry programs are
available!
– “DFT B3LYP 6-31G*” calculation is the golden
standard!
• We want to lead chemistry
– We usually explain what happened.
– We rarely predict something very exciting!
Difference between experiment and
calculation/theory
• Finding interesting phenomena or problem
– How we convert from CO2 to O2? N2+H2 to NH3?
– How to synthesize a compound from known ingredients?
• Design a key chemical reaction.
• Calculations
or
• Experiments
• Analysis of results
• Propose new experiments
Only One Difference
Difference between experiment and
calculation/theory
• No difference as science
• Most important thing is chemical intuition!
• Can we implement chemical intuition on
computers?
– Yes, but apparently long way to go.
– Basic strategy is : collect data and fed to computer
and process.
Can we implement chemical intuition
on computers?
• Collect facts by computer calculations.
– Many good implementations are available.
– Huge computer resources are required but
– They are still growing exponentially
• Fed them to computers.
Can we implement chemical intuition
on computers?
• Fed them to computers.
• Machine Learning (ML)
– Very successful on
Image /sound recognition,
natural language processing.
Organic chemistry is somewhat similar to language…
Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B. A. (2014), Organic Chemistry as a Language and the
Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses. Angew. Chem. Int. Ed., 53: 8108–8112.
doi:10.1002/anie.201403708
Recently, some research papers by using ML have been published
Big Data meets Quantum Chemistry Approximations: The Δ-Machine Learning
Approach Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von
Lilienfeld http://arxiv.org/abs/1503.04987 etc..
Better results by ML, we require huge dataset
Can we implement chemical intuition
on computers?
The first step might be:
• Build a huge dataset by quantum chemistry
program packages!
– Results should agree with experiments.
– Improvements on dataset is task of QC researchers.
• Faster calculations for larger systems
• Better or sufficient treatment for electron correlations
• And build a search engine database using the
result.
Googling molecule
Gives you what you need
+
What are needed for Googling molecule?
1. Types, kinds, variety of molecules
– # of molecules are infinity; but cover important ones
2. Required properties of molecules
– Molecular structure, energy, UV excitation energy, dipole
moment
3. Getting properties of molecules by calculation?
– Accuracy of calculation, and computer resources…
4. Coding or Encoding molecule
– IUPAC nomenclature is not suitable
– Do not think about graph theory
5. Fast calculation (with deep learning(?))
10^8 molecules/sec, as chemical space is huge.
Databases for lists of molecules
• PubChem: 50,000,000 molecules listed, made by NIH,
public domain, no curating (imported from catalogs,
etc), can obtain via ftp.
• ChemSpider : 28,000,000 entries, better curating, no
ftp. Restricted for redistribution, download
• Web-GDB13 : 900,000,000 entries, just generated by
combinatorics. No
• Zinc, CheMBL, DrugBank …
• CAS : 70,000,000 molecules, proprietary
• Nikkaji: 6,000,000, proprietary
We use for source of molecules
The PubChem
Ex. A molecule listed in PubChem
Database for molecular properties by
experiments
• We must do some experiments for obtaining
molecular properties.
– No free comprehensive database is known so far.
– Pharmaceutical companies do O(1,000,000)
experiments for high throughput screening.
• Experiments cost huge!
– Time consuming, large facilities, costs, hazardous
We do not do experiments!
Database for molecular properties by computer
calculation
• Golden Standard method “Density functional
theory (B3LYP functional) + 6-31g(d) basis set”
– Accuracy is quite satisfactory (1-10kcal/mol) for
biological systems, organic chemistry.
– Good implementations are available.
– Costs less (fast, just super computer, no hazardous)
– Time for calculations becomes less
• Intel Core i7 (esp. SandyBridge) is very fast.
• Still we need huge resources, though.
We calculate by computer instead!
What is a molecule?
3D coordinates
Hard to understand
but regours
Easy to understand
But many coner cases
Propionaldehyde
No rigorous definition for a molecule
wavefunction
Common name
IUPAC
nomencleature
Structure
Wikipediaより
What is a molecule?
• No rigorous definition for “what is a molecule”
• nomenclature
– 3D coordinates for nucleus
– Structural formula
– IUPAC nomenclature
– Higher abstraction or less abstraction?
• Better molecular encoding method?
– Easy to understand for human
– Easy to understand for computer as well
– Can describe most cases, and less corner cases.
– Compromise between dream and reality
Encoding molecule : SMILES
Encoding molecule
SMILES is a good encoding method for molecules
IUPAC nomenclature
tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl)
methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)-
2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]-
6-oxo-1-phenylhexan-2-yl]carbamate
We can encode molecule
• SMILES
CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24
• InChI Made by IUPAC
InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11
-15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/
h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3
…
What is SMILES?
• Simplified Molecular Input Line Entry System
– A linear representation of molecule using ASCII.
– Conformation is also encoded
– Human readable, and also machine readable.
– Almost one-to-one mapping between a molecule and
SMILES via universal SMILES
• David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES
• InChI by IUPAC
– International Chemical Identifier : open standard (non proprietary)
– NM O’Boyle invented “Universal SMILES” via InChI
Example by SMILES
http://en.wikipedia.org/wiki/SMILES
分子 構造 SMILES
Nitrogen molecule N≡N N#N
copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-]
oenanthotoxin CCC[C@@H](O)CCC=CC=C
C#CC#CC=CCO
Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C
)nc(N)2
Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c
2cc(OC)c4c3OC(=O)C5=C4CC
C(=O)5
Some corner cases
Two different SMILES for Ferrocene
• C12C3C4C5C1[Fe]23451234C5C1C2C3C45
• [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]
Now its my turn
Construction of ab initio chemical
database
• Molecular information is from PubChem
• Properties are calculated from the first principle using
computer
– Many program packages are available
– DFT (B3LYP)
– 6-31G(d) basis set and geometry optimization
– Excited states calculation by TD-DFT 6-31G+(d)
– Best for organic molecules or bio molecules
• Molecular encoding : SMILES / InChI
• Huge computer resources
• Dream come true
– Google like search engine for chemistry
The PubChemQC Project
• http://pubchemqc.riken.jp/
• AIP Conf. Proc. 1702, 090058 (2015);
http://dx.doi.org/10.1063/1.4938866
• A public domain database for molecules
• Ab initio (The first principle) calculation of molecular
properties of PubChem
• 2014/1/15: 13,000 molecules
• 2014/7/29 : 155,792 molecules
• 2014/10/30 : 906,798 molecules
• 2014/12/3 : 1,137,286 molecules
• 2015/3/25 : 1,673,532 molecules
• 2015/5/27: 2,122,146 molecules
• 2016/2/10: 3,046,948 (2,660,218 with excited states)
The PubChemQC project
http://pubchemqc.riken.jp/
WIP: no search engine, just data
PubChemQC
http://pubchemqc.riken.jp/
PubChemQC
http://pubchemqc.riken.jp/
Related works
• Related works
– Raghunathan Ramakrishnan, Pavlo Dral, Matthias Rupp, O.
Anatole von Lilienfeld: Quantum Chemistry Structures and
Properties of 134 kilo Molecules, Scientific Data, 1: 140022,
Nature Publishing Group, 2014.
– NIST Web Book
• http://webbook.nist.gov/chemistry/
• Small numbers of molecules. Comparing many methods
– Harvard Clean Energy Project
• http://cleanenergy.molecularspace.org/
• 25,000,000 (?), molecules for photo devices made by combinatrics
– Sugimoto et al :2013CBI symposium poster
• Almost same as our database, currently not open to the
public(now??)
Our contribution: 20 times larger
How we do?
• Generate initial 3D conformation by OpenBABEL
– SDF contains 3D conformation but we don’t use.
– OpenBABEL –h (add hydrogen) --gen3d (generation of 3d
coordinate)
• Ab initio calculation by GAMESS+firefly
– Using Gaussian can lead to a political problem(?)
– PM3 optimization
– Hartree-Fock/STO-6G geometry optimization
– Firefly+GAMESS geometry optimization in B3LYP/6-31G*
– Ten excitation energies by TDDFT/6-31G+* (no geom
optimization)
How we do?
• Heavily using OpenBABEL
• Extraction Molecular information
– Sort by molecular weight of PubChem compouds
– OpenBABEL
• Encoded by SMILES
– Isomeric smiles: 3D conformation retained
– OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@
@H](O)1
– CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO
– CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
How to convert pubchem Compound
to quantum chemistry calculation
aflatoxin
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Ab initio calculation by
OpenBABEL
Final results will be
• Uploaded to http://pubchemqc.riken.jp/
• Currently we upload
– input file (ground / excited state)
– Output file (ground / excited state)
– Final geometry in Mol file
Scaling of computation
• Embarrassingly parallel for each molecule
• Very roughly speaking, required time for
calculation scales like N^4
– N : molecular weight
• Problems are very hard (complexity theory)
– Hartree-Fock calculation
– DFT (b3lyp) calculation
– geometry optimization
• Practically many molecules can be solved
efficiently
Computer Resources
• RICC : Intel Xeon 5570 Westmere, 2.93GHz 8
cores/node) x 1000
– 1000-10000 molecules/day (MW 160)
– Heavily depend on conditions of other users
– Time limit: 8 hours
• Quest : Intel Core2 duo (1.6GHz/node) x 700
– 3000-8000 molecules / day (MW 160)
– 100-1000 molecules / day (MW 200-300)
– Time limit: 20 hours
• Some compounds fail to calculate are ignored for
this time.
Computer Resources
• Storage
– Approx. 500GB for 1,000,000 molecules (xz
compressed)
– Approx. 20 TB for 40,000,000 molecules (xz
compressed)
Molecular weight and Lipinski Rule
• Lipinski’s five rule (Pfizer's rule of five): rule of
thumb for drug discovery
• No more than 5 hydrogen bond donors
• Not more than 10 hydrogen bond acceptors
• A molecular mass less than 500 daltons
• An octanol-water partition coefficient log P not greater than 5
• Molecular weight should be smaller than 500 is
very good for computational chemistry
– For routine calculations without experimental data
other than molecular formula
– If larger than 500, secondary or higher structure
becomes important. E.g., protein
Molecular Weight distribution at
PubChem
We are still here
Lipinski limit MW=500
30,000,000 molecules
(excluding mixtures)
How long it will take to finish?
• For drug design, we need to calculate all
molecules of MW < 500
• Total 30,000,000 molecules
– This number may increase in the future
• Current (2014/12/4) 1,100,000 molecules
– Only 3%
• 10,000 molecules/day -> 8.2years
How long it will take to finish?
• 10+ years? No, maybe far less.
• 25 years ago (1990) computers are so slow
– Even ab initio calculations are very difficult on
486DX@25MHz or
68000@10MHz
Outlook, prospect, hope…
• Far better in silico screening
– Less or no experiment is necessary
• Even more faster calculation using machine learning
– 10,000 molecules / second ?
– Requires huge data set to learn.
– bio or organic molecules are easy to calculate.
– Already available: Raghunathan Ramakrishnan
https://scholar.google.co.jp/citations?user=jSCGozoA
AAAJ&hl=ja&oi=sra
• Database for chemical reaction
– Precise calculation is required
– GRRM method + machine learning (?)
• Geometry optimization for Protein (PDB)
– Only X ray crystal structures are available
http://pubchemqc.riken.jp/
Difficulties in this project
• Parameters needed for calculations varies by
molecules
• Properties can be different by initial guess
• Computer Resources
– Raspberry Pi? NVIDIA Jetson? Bonic?
• Molecular encoding never ends
– SMILES or InChI is not complete
– Some corner cases may be chemically interesting.

Kobeworkshop pubchemqc project

  • 1.
    The PubChemQC Project Abig data construction by first- principles calculations of molecules 中田真秀 (NAKATA Maho) ACCC RIKEN 2016/2/17 15:50-16:40 Kobe workshop for material design on strongly correlated electrons in molecules and materials http://www.aics.riken.jp/labs/cms/workshop/201602/index.html
  • 2.
    Background • Atoms andmolecules are all composed of matter. • A dream of theoretical chemist: do chemistry without experiment! • On computers  • Chemical space is really huge! – The number of candidates for drugs 1060http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/a bstract) • Cf. Exa: 1018 – Combinatorics problem – Adding chemical reaction 10120
  • 3.
    Why 2-RDM theoryhas been suspended? • Is there short cut for solving Schrodinger Eq? – Density functional theory, reduced density matrix theory • Using 2-particle reduced density matrices, we can reduce the number of variables drastically. – Journal of Chemical Physics, 114, 8282-8292 (2001). Introduction of semidifinite programming – Computational and Theoretical Chemistry Volume 1003, 1 January 2013, Pages 22-7 Application to 2D Hubbard model – Journal of chemical physics 128, 16 164113 (2008). Variouls molecules • However it is not size-consistent, nor size-extensive. – Phys. Chem. Chem. Phys., 2009,11, 5558-5560 – AIP Advances 2, 032125 (2012) – Physical Review A 80, 042109 (2009)
  • 4.
    Fundamental question tosolving SCE… • Does this problem can be solved efficiently? – Very likely NO! – Example. spinglass Hamiltonian is very hard to solve: this is as hard as solving Traveling Salesperson Problem – Algorithms without assumption on 2-particle interaction are never efficient.
  • 5.
    Fundamental question tosolving SCE… Results from computational complexity theory • N-representability problem is QMA-hard – Liu, Y.-K., Christandl, M. & Verstraete, F. Quantum computational complexity of the n-representability problem: Qma complete. Phys. Rev. Lett. 98, 110503 (2007). • Solving 2-local Hamiltonian is also QMA-hard – The Complexity of the Local Hamiltonian Problem – SIAM J. Comput., 35(5), 1070–1097. http://epubs.siam.org/doi/abs/10.1137/S0097539704445226 • finding the ground-state energy of the Hubbard model in an external magnetic field is still QMA-hard – http://www.nature.com/nphys/journal/v5/n10/abs/nphys 1370.html • Good review:Computational Complexity in Electronic Structure – http://arxiv.org/abs/1208.3334
  • 6.
    Fundamental question tosolving SCE… • What I have learned – No algorithm to solve general 2-particle Hamiltonian efficiently. – No algorithm to solve electronic Hamiltonian efficiently (maybe) – Introduction of other conditions on 2-particle interaction are mandatory. Heuristics is much more important than thinking about subtle shortcut.
  • 7.
    Current status ofcomputational chemistry • Relatively good agreements with experiments. • Can explain chemical phenomena – Many good quantum chemistry programs are available! – “DFT B3LYP 6-31G*” calculation is the golden standard! • We want to lead chemistry – We usually explain what happened. – We rarely predict something very exciting!
  • 8.
    Difference between experimentand calculation/theory • Finding interesting phenomena or problem – How we convert from CO2 to O2? N2+H2 to NH3? – How to synthesize a compound from known ingredients? • Design a key chemical reaction. • Calculations or • Experiments • Analysis of results • Propose new experiments Only One Difference
  • 9.
    Difference between experimentand calculation/theory • No difference as science • Most important thing is chemical intuition! • Can we implement chemical intuition on computers? – Yes, but apparently long way to go. – Basic strategy is : collect data and fed to computer and process.
  • 10.
    Can we implementchemical intuition on computers? • Collect facts by computer calculations. – Many good implementations are available. – Huge computer resources are required but – They are still growing exponentially • Fed them to computers.
  • 11.
    Can we implementchemical intuition on computers? • Fed them to computers. • Machine Learning (ML) – Very successful on Image /sound recognition, natural language processing. Organic chemistry is somewhat similar to language… Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. and Grzybowski, B. A. (2014), Organic Chemistry as a Language and the Implications of Chemical Linguistics for Structural and Retrosynthetic Analyses. Angew. Chem. Int. Ed., 53: 8108–8112. doi:10.1002/anie.201403708 Recently, some research papers by using ML have been published Big Data meets Quantum Chemistry Approximations: The Δ-Machine Learning Approach Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld http://arxiv.org/abs/1503.04987 etc.. Better results by ML, we require huge dataset
  • 12.
    Can we implementchemical intuition on computers? The first step might be: • Build a huge dataset by quantum chemistry program packages! – Results should agree with experiments. – Improvements on dataset is task of QC researchers. • Faster calculations for larger systems • Better or sufficient treatment for electron correlations • And build a search engine database using the result.
  • 13.
    Googling molecule Gives youwhat you need +
  • 14.
    What are neededfor Googling molecule? 1. Types, kinds, variety of molecules – # of molecules are infinity; but cover important ones 2. Required properties of molecules – Molecular structure, energy, UV excitation energy, dipole moment 3. Getting properties of molecules by calculation? – Accuracy of calculation, and computer resources… 4. Coding or Encoding molecule – IUPAC nomenclature is not suitable – Do not think about graph theory 5. Fast calculation (with deep learning(?)) 10^8 molecules/sec, as chemical space is huge.
  • 15.
    Databases for listsof molecules • PubChem: 50,000,000 molecules listed, made by NIH, public domain, no curating (imported from catalogs, etc), can obtain via ftp. • ChemSpider : 28,000,000 entries, better curating, no ftp. Restricted for redistribution, download • Web-GDB13 : 900,000,000 entries, just generated by combinatorics. No • Zinc, CheMBL, DrugBank … • CAS : 70,000,000 molecules, proprietary • Nikkaji: 6,000,000, proprietary We use for source of molecules
  • 16.
  • 17.
    Ex. A moleculelisted in PubChem
  • 18.
    Database for molecularproperties by experiments • We must do some experiments for obtaining molecular properties. – No free comprehensive database is known so far. – Pharmaceutical companies do O(1,000,000) experiments for high throughput screening. • Experiments cost huge! – Time consuming, large facilities, costs, hazardous We do not do experiments!
  • 19.
    Database for molecularproperties by computer calculation • Golden Standard method “Density functional theory (B3LYP functional) + 6-31g(d) basis set” – Accuracy is quite satisfactory (1-10kcal/mol) for biological systems, organic chemistry. – Good implementations are available. – Costs less (fast, just super computer, no hazardous) – Time for calculations becomes less • Intel Core i7 (esp. SandyBridge) is very fast. • Still we need huge resources, though. We calculate by computer instead!
  • 20.
    What is amolecule? 3D coordinates Hard to understand but regours Easy to understand But many coner cases Propionaldehyde No rigorous definition for a molecule wavefunction Common name IUPAC nomencleature Structure Wikipediaより
  • 21.
    What is amolecule? • No rigorous definition for “what is a molecule” • nomenclature – 3D coordinates for nucleus – Structural formula – IUPAC nomenclature – Higher abstraction or less abstraction? • Better molecular encoding method? – Easy to understand for human – Easy to understand for computer as well – Can describe most cases, and less corner cases. – Compromise between dream and reality
  • 22.
    Encoding molecule :SMILES Encoding molecule SMILES is a good encoding method for molecules IUPAC nomenclature tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl) methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)- 2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]- 6-oxo-1-phenylhexan-2-yl]carbamate We can encode molecule • SMILES CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24 • InChI Made by IUPAC InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11 -15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/ h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3 …
  • 23.
    What is SMILES? •Simplified Molecular Input Line Entry System – A linear representation of molecule using ASCII. – Conformation is also encoded – Human readable, and also machine readable. – Almost one-to-one mapping between a molecule and SMILES via universal SMILES • David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES • InChI by IUPAC – International Chemical Identifier : open standard (non proprietary) – NM O’Boyle invented “Universal SMILES” via InChI
  • 24.
    Example by SMILES http://en.wikipedia.org/wiki/SMILES 分子構造 SMILES Nitrogen molecule N≡N N#N copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-] oenanthotoxin CCC[C@@H](O)CCC=CC=C C#CC#CC=CCO Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C )nc(N)2 Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c 2cc(OC)c4c3OC(=O)C5=C4CC C(=O)5
  • 25.
    Some corner cases Twodifferent SMILES for Ferrocene • C12C3C4C5C1[Fe]23451234C5C1C2C3C45 • [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]
  • 26.
  • 27.
    Construction of abinitio chemical database • Molecular information is from PubChem • Properties are calculated from the first principle using computer – Many program packages are available – DFT (B3LYP) – 6-31G(d) basis set and geometry optimization – Excited states calculation by TD-DFT 6-31G+(d) – Best for organic molecules or bio molecules • Molecular encoding : SMILES / InChI • Huge computer resources • Dream come true – Google like search engine for chemistry
  • 28.
    The PubChemQC Project •http://pubchemqc.riken.jp/ • AIP Conf. Proc. 1702, 090058 (2015); http://dx.doi.org/10.1063/1.4938866 • A public domain database for molecules • Ab initio (The first principle) calculation of molecular properties of PubChem • 2014/1/15: 13,000 molecules • 2014/7/29 : 155,792 molecules • 2014/10/30 : 906,798 molecules • 2014/12/3 : 1,137,286 molecules • 2015/3/25 : 1,673,532 molecules • 2015/5/27: 2,122,146 molecules • 2016/2/10: 3,046,948 (2,660,218 with excited states)
  • 29.
  • 30.
  • 31.
  • 32.
    Related works • Relatedworks – Raghunathan Ramakrishnan, Pavlo Dral, Matthias Rupp, O. Anatole von Lilienfeld: Quantum Chemistry Structures and Properties of 134 kilo Molecules, Scientific Data, 1: 140022, Nature Publishing Group, 2014. – NIST Web Book • http://webbook.nist.gov/chemistry/ • Small numbers of molecules. Comparing many methods – Harvard Clean Energy Project • http://cleanenergy.molecularspace.org/ • 25,000,000 (?), molecules for photo devices made by combinatrics – Sugimoto et al :2013CBI symposium poster • Almost same as our database, currently not open to the public(now??) Our contribution: 20 times larger
  • 33.
    How we do? •Generate initial 3D conformation by OpenBABEL – SDF contains 3D conformation but we don’t use. – OpenBABEL –h (add hydrogen) --gen3d (generation of 3d coordinate) • Ab initio calculation by GAMESS+firefly – Using Gaussian can lead to a political problem(?) – PM3 optimization – Hartree-Fock/STO-6G geometry optimization – Firefly+GAMESS geometry optimization in B3LYP/6-31G* – Ten excitation energies by TDDFT/6-31G+* (no geom optimization)
  • 34.
    How we do? •Heavily using OpenBABEL • Extraction Molecular information – Sort by molecular weight of PubChem compouds – OpenBABEL • Encoded by SMILES – Isomeric smiles: 3D conformation retained – OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@ @H](O)1 – CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO – CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
  • 35.
    How to convertpubchem Compound to quantum chemistry calculation aflatoxin O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5 Ab initio calculation by OpenBABEL
  • 36.
    Final results willbe • Uploaded to http://pubchemqc.riken.jp/ • Currently we upload – input file (ground / excited state) – Output file (ground / excited state) – Final geometry in Mol file
  • 37.
    Scaling of computation •Embarrassingly parallel for each molecule • Very roughly speaking, required time for calculation scales like N^4 – N : molecular weight • Problems are very hard (complexity theory) – Hartree-Fock calculation – DFT (b3lyp) calculation – geometry optimization • Practically many molecules can be solved efficiently
  • 38.
    Computer Resources • RICC: Intel Xeon 5570 Westmere, 2.93GHz 8 cores/node) x 1000 – 1000-10000 molecules/day (MW 160) – Heavily depend on conditions of other users – Time limit: 8 hours • Quest : Intel Core2 duo (1.6GHz/node) x 700 – 3000-8000 molecules / day (MW 160) – 100-1000 molecules / day (MW 200-300) – Time limit: 20 hours • Some compounds fail to calculate are ignored for this time.
  • 39.
    Computer Resources • Storage –Approx. 500GB for 1,000,000 molecules (xz compressed) – Approx. 20 TB for 40,000,000 molecules (xz compressed)
  • 40.
    Molecular weight andLipinski Rule • Lipinski’s five rule (Pfizer's rule of five): rule of thumb for drug discovery • No more than 5 hydrogen bond donors • Not more than 10 hydrogen bond acceptors • A molecular mass less than 500 daltons • An octanol-water partition coefficient log P not greater than 5 • Molecular weight should be smaller than 500 is very good for computational chemistry – For routine calculations without experimental data other than molecular formula – If larger than 500, secondary or higher structure becomes important. E.g., protein
  • 41.
    Molecular Weight distributionat PubChem We are still here Lipinski limit MW=500 30,000,000 molecules (excluding mixtures)
  • 42.
    How long itwill take to finish? • For drug design, we need to calculate all molecules of MW < 500 • Total 30,000,000 molecules – This number may increase in the future • Current (2014/12/4) 1,100,000 molecules – Only 3% • 10,000 molecules/day -> 8.2years
  • 43.
    How long itwill take to finish? • 10+ years? No, maybe far less. • 25 years ago (1990) computers are so slow – Even ab initio calculations are very difficult on 486DX@25MHz or 68000@10MHz
  • 44.
    Outlook, prospect, hope… •Far better in silico screening – Less or no experiment is necessary • Even more faster calculation using machine learning – 10,000 molecules / second ? – Requires huge data set to learn. – bio or organic molecules are easy to calculate. – Already available: Raghunathan Ramakrishnan https://scholar.google.co.jp/citations?user=jSCGozoA AAAJ&hl=ja&oi=sra • Database for chemical reaction – Precise calculation is required – GRRM method + machine learning (?) • Geometry optimization for Protein (PDB) – Only X ray crystal structures are available http://pubchemqc.riken.jp/
  • 45.
    Difficulties in thisproject • Parameters needed for calculations varies by molecules • Properties can be different by initial guess • Computer Resources – Raspberry Pi? NVIDIA Jetson? Bonic? • Molecular encoding never ends – SMILES or InChI is not complete – Some corner cases may be chemically interesting.