The PubChemQC Project 
A big data construction by first-principles 
calculations of molecules 
中田真秀(NAKATA Maho) 
ACCC RIKEN 
maho@riken.jp 
2014/12/3 10:35-11:05 
JST CREST International Symposium on Post 
Petescale System Software
Background 
• Atoms and molecules are all composed of matter. 
• The dream of theoretical chemist: do chemistry 
without experiment! 
• On computers  
• We treat big data in chemistry! 
– Chemical space is really huge! 
• The number of candidates for drugs 
1060 
http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/ 
abstract) 
• Cf. Exa: 1018
Current status of computational 
chemistry 
• Relatively good agreements with experiments. 
• Can explain nature in many cases. 
– Many good quantum chemistry programs are 
available! 
– “DFT B3LYP 6-31G*” calculations rule! 
• We want to lead chemistry 
– We only explain what happened.
Difference between experiment and 
calculation/theory 
• Finding interesting phenomena or problem 
– How we convert from CO2 to O2? N2+H2 to NH3? 
– How to synthesize a compound? 
• Design a key chemical reaction. 
• Calculations 
• Experiments 
– Analyze 
• Analysis of results 
• Propose new experiments 
Only One Difference
Difference between experiment and 
calculation/theory 
• No difference as science 
• Most important thing is curiosity! 
New insights from 
big data and 
my sensitivity! 
Unfortunately, not so many easy-to-use 
big data for chemistry
Googling molecule 
+ 
Give you recommended molecules!
What are needed for Googling molecule? 
1. Types, kinds, variety of molecules 
– # of molecules are infinity; but cover important ones 
2. Required properties of molecules 
– Molecular structure, energy, UV excitation energy, 
dipole moment 
3. Getting properties of molecules by calculation? 
– Accuracy of calculation, and computer resources… 
4. Coding or Encoding molecule 
– IUPAC nomenclature is not suitable 
– Do not think about graph theory
Databases for lists of molecules 
• PubChem: 50,000,000 molecules listed, made by NIH, 
public domain, no curating (imported from catalogs, 
etc), can obtain via ftp. 
• ChemSpider : 28,000,000 entries, better curating, no 
ftp. Restricted for redistribution, download 
• Web-GDB13 : 900,000,000 entries, just generated by 
combinatorics. No 
• Zinc, CheMBL, DrugBank … 
• CAS : 70,000,000 molecules, proprietary 
• Nikkaji: 6,000,000, proprietary 
We use for source of molecules
The PubChem
Ex. A molecule listed in PubChem
Database for molecular properties by 
experiments 
• We must do some experiments for obtaining 
molecular properties. 
– No free comprehensive database is known so far. 
– Pharmaceutical companies do O(1,000,000) 
experiments for high throughput screening. 
• Experiments cost huge! 
– Time consuming, large facilities, costs, hazardous 
We do not do experiments!
Database for molecular properties by computer 
calculation 
• Golden Standard method “Density functional 
theory (B3LYP functional) + 6-31g(d) basis set” 
– Accuracy is quite satisfactory (1-10kcal/mol) for 
biological systems, organic chemistry. 
– Good implementations are available. 
– Costs less (fast, just super computer, no hazardous) 
– Time for calculations becomes less 
• Intel Core i7 (esp. SandyBridge) is very fast. 
• Still we need huge resources, though. 
We calculate by computer instead!
What is a molecule? 
No rigorous definition for a molecule 
3D coordinates 
Hard to understand 
but regours 
Easy to understand 
But many coner cases 
Propionaldehyde 
wavefunction 
Common name 
IUPAC 
nomencleature 
Structure 
Wikipediaより
What is a molecule? 
• No rigorous definition for “what is a molecule” 
• nomenclature 
– 3D coordinates for nucleus 
– Structural formula 
– IUPAC nomenclature 
– Higher abstraction or less abstraction? 
• Better molecular encoding method? 
– Easy to understand for human 
– Easy to understand for computer as well 
– Can describe most cases, and less corner cases. 
– Compromise between dream and reality
Encoding molecule : SMILES 
Encoding molecule 
IUPAC nomenclature 
tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl) 
methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)- 
2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]- 
6-oxo-1-phenylhexan-2-yl]carbamate 
We can encode molecule 
• SMILES 
CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24 
• InChI Made by IUPAC 
InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11 
-15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/ 
h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3 
… 
SMILES is a good encoding method for molecules
What is SMILES? 
• Simplified Molecular Input Line Entry System 
– A linear representation of molecule using ASCII. 
– Conformation is also encoded 
– Human readable, and also machine readable. 
– Almost one-to-one mapping between a molecule and 
SMILES via universal SMILES 
• David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES 
• InChI by IUPAC 
– International Chemical Identifier : open standard (non proprietary) 
– NM O’Boyle invented “Universal SMILES” via InChI
Example by SMILES 
http://en.wikipedia.org/wiki/SMILES 
分子構造SMILES 
Nitrogen molecule N≡N N#N 
copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-] 
oenanthotoxin CCC[C@@H](O)CCC=CC=C 
C#CC#CC=CCO 
Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C 
)nc(N)2 
Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c 
2cc(OC)c4c3OC(=O)C5=C4CC 
C(=O)5
Some corner cases 
Two different SMILES for Ferrocene 
• C12C3C4C5C1[Fe]23451234C5C1C2C3C45 
• [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]
Now its my turn
Construction of ab initio chemical 
database 
• Molecular information is from PubChem 
• Properties are calculated from the first principle using 
computer 
– Many program packages are available 
– DFT (B3LYP) 
– 6-31G(d) basis set and geometry optimization 
– Excited states calculation by TD-DFT 6-31G+(d) 
– Best for organic molecules or bio molecules 
• Molecular encoding : SMILES / InChI 
• Huge computer resources 
• Dream come true 
– Google like search engine for chemistry
The PubChemQC Project 
• http://pubchemqc.riken.jp/ 
• A open database for molecules 
– Public domain 
• Ab initio (The first principle) calculation of 
molecular properties of PubChem 
• 2014/1/15: 13,000 molecules 
• 2014/7/29 : 155,792 molecules 
• 2014/10/30 : 906,798 molecules 
• 2014/12/3 : 1,137,286 molecules
The PubChemQC project 
http://pubchemqc.riken.jp/ 
WIP: no search engine, just data
PubChemQC 
http://pubchemqc.riken.jp/
PubChemQC 
http://pubchemqc.riken.jp/
Related works 
• Related works 
– NIST Web Book 
• http://webbook.nist.gov/chemistry/ 
• Small numbers of molecules. Comparing many methods 
– Harvard Clean Energy Project 
• http://cleanenergy.molecularspace.org/ 
• 25,000,000 (?), molecules for photo devices made by 
combinatrics 
– Sugimoto et al :2013CBI symposium poster 
• Almost same as our database, currently not open to the 
public(now??)
How we do? 
• Generate initial 3D conformation by OpenBABEL 
– SDF contains 3D conformation but we don’t use. 
– OpenBABEL –h (add hydrogen) --gen3d (generation of 3d 
coordinate) 
• Ab initio calculation by GAMESS+firefly 
– Using Gaussian can lead to a political problem(?) 
– PM3 optimization 
– Hartree-Fock/STO-6G geometry optimization 
– Firefly+GAMESS geometry optimization in B3LYP/6-31G* 
– Ten excitation energies by TDDFT/6-31G+* (no geom 
optimization)
How we do? 
• Heavily using OpenBABEL 
• Extraction Molecular information 
– Sort by molecular weight of PubChem compouds 
– OpenBABEL 
• Encoded by SMILES 
– Isomeric smiles: 3D conformation retained 
– OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@ 
@H](O)1 
– CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO 
– CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
Our way to pubchem Compound to 
quantum chemistry calculation 
aflatoxin 
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5 
Ab initio calculation by 
OpenBABEL
Final results will be 
• Uploaded to http://pubchemqc.riken.jp/ 
• Currently we upload 
– input file (ground / excited state) 
– Output file (ground / excited state) 
– Final geometry in Mol file
Scaling of computation 
• Embarrassingly parallel for each molecule 
• Very roughly speaking, required time for 
calculation scales like N^4 
– N : molecular weight 
• Problems are very hard (complexity theory) 
– Hartree-Fock calculation 
– DFT (b3lyp) calculation 
– geometry optimization 
• Practically many molecules can be solved 
efficiently
Computer Resources 
• RICC : Intel Xeon 5570 Westmere, 2.93GHz 8 
cores/node) x 1000 
– 1000-10000 molecules/day (MW 160) 
– Heavily depend on conditions of other users 
– Time limit: 8 hours 
• Quest : Intel Core2 duo (1.6GHz/node) x 700 
– 3000-8000 molecules / day (MW 160) 
– 100-1000 molecules / day (MW 200-300) 
– Time limit: 20 hours 
• Some compounds fail to calculate are ignored for 
this time.
Computer Resources 
• Storage 
– Approx. 500GB for 1,000,000 molecules (xz 
compressed) 
– Approx. 20 TB for 40,000,000 molecules (xz 
compressed)
Molecular weight and Lipinski Rule 
• Lipinski’s five rule (Pfizer's rule of five): rule of 
thumb for drug discovery 
• No more than 5 hydrogen bond donors 
• Not more than 10 hydrogen bond acceptors 
• A molecular mass less than 500 daltons 
• An octanol-water partition coefficient log P not greater than 5 
• Molecular weight should be smaller than 500 is 
very good for computational chemistry 
– For routine calculations without experimental data 
other than molecular formula 
– If larger than 500, secondary or higher structure 
becomes important. E.g., protein
Molecular Weight distribution at 
PubChem 
Lipinski limit MW=500 
We are still here 
30,000,000 molecules 
(excluding mixtures)
How long it will take to finish? 
• For drug design, we need to calculate all 
molecules of MW < 500 
• Total 30,000,000 molecules 
– This number may increase in the future 
• Current (2014/12/4) 1,100,000 molecules 
– Only 3% 
• 10,000 molecules/day -> 8.2years
How long it will take to finish? 
• 10+ years? No, maybe far less. 
• 25 years ago (1990) computers are so slow 
– Even ab initio calculations are very difficult on 
486DX@25MHz or 
68000@10MHz
Outlook, prospect, hope… 
• Far better in silico screening 
– Less or no experiment is necessary 
• Even more faster calculation using machine learning 
– 10,000 molecules / second ? 
– Using our data as learning set. 
– Not difficult for bio or organic molecules 
– Far better initial guess 
• Database for chemical reaction 
– Precise calculation is required 
– GRRM method + machine learning (?) 
• Geometry optimization for Protein (PDB) 
– Only X ray crystal structures are available 
http://pubchemqc.riken.jp/
Difficulties in this project 
• Parameters needed for calculations varies by 
molecules 
• Properties can be different by initial guess 
• Computer Resources 
– Raspberry Pi? NVIDIA Jetson? Bonic? 
• Molecular encoding never ends 
– SMILES or InChI is not complete 
– Some corner cases may be chemically interesting.

The PubChemQC Project

  • 1.
    The PubChemQC Project A big data construction by first-principles calculations of molecules 中田真秀(NAKATA Maho) ACCC RIKEN maho@riken.jp 2014/12/3 10:35-11:05 JST CREST International Symposium on Post Petescale System Software
  • 2.
    Background • Atomsand molecules are all composed of matter. • The dream of theoretical chemist: do chemistry without experiment! • On computers  • We treat big data in chemistry! – Chemical space is really huge! • The number of candidates for drugs 1060 http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/ abstract) • Cf. Exa: 1018
  • 3.
    Current status ofcomputational chemistry • Relatively good agreements with experiments. • Can explain nature in many cases. – Many good quantum chemistry programs are available! – “DFT B3LYP 6-31G*” calculations rule! • We want to lead chemistry – We only explain what happened.
  • 4.
    Difference between experimentand calculation/theory • Finding interesting phenomena or problem – How we convert from CO2 to O2? N2+H2 to NH3? – How to synthesize a compound? • Design a key chemical reaction. • Calculations • Experiments – Analyze • Analysis of results • Propose new experiments Only One Difference
  • 5.
    Difference between experimentand calculation/theory • No difference as science • Most important thing is curiosity! New insights from big data and my sensitivity! Unfortunately, not so many easy-to-use big data for chemistry
  • 6.
    Googling molecule + Give you recommended molecules!
  • 7.
    What are neededfor Googling molecule? 1. Types, kinds, variety of molecules – # of molecules are infinity; but cover important ones 2. Required properties of molecules – Molecular structure, energy, UV excitation energy, dipole moment 3. Getting properties of molecules by calculation? – Accuracy of calculation, and computer resources… 4. Coding or Encoding molecule – IUPAC nomenclature is not suitable – Do not think about graph theory
  • 8.
    Databases for listsof molecules • PubChem: 50,000,000 molecules listed, made by NIH, public domain, no curating (imported from catalogs, etc), can obtain via ftp. • ChemSpider : 28,000,000 entries, better curating, no ftp. Restricted for redistribution, download • Web-GDB13 : 900,000,000 entries, just generated by combinatorics. No • Zinc, CheMBL, DrugBank … • CAS : 70,000,000 molecules, proprietary • Nikkaji: 6,000,000, proprietary We use for source of molecules
  • 9.
  • 10.
    Ex. A moleculelisted in PubChem
  • 11.
    Database for molecularproperties by experiments • We must do some experiments for obtaining molecular properties. – No free comprehensive database is known so far. – Pharmaceutical companies do O(1,000,000) experiments for high throughput screening. • Experiments cost huge! – Time consuming, large facilities, costs, hazardous We do not do experiments!
  • 12.
    Database for molecularproperties by computer calculation • Golden Standard method “Density functional theory (B3LYP functional) + 6-31g(d) basis set” – Accuracy is quite satisfactory (1-10kcal/mol) for biological systems, organic chemistry. – Good implementations are available. – Costs less (fast, just super computer, no hazardous) – Time for calculations becomes less • Intel Core i7 (esp. SandyBridge) is very fast. • Still we need huge resources, though. We calculate by computer instead!
  • 13.
    What is amolecule? No rigorous definition for a molecule 3D coordinates Hard to understand but regours Easy to understand But many coner cases Propionaldehyde wavefunction Common name IUPAC nomencleature Structure Wikipediaより
  • 14.
    What is amolecule? • No rigorous definition for “what is a molecule” • nomenclature – 3D coordinates for nucleus – Structural formula – IUPAC nomenclature – Higher abstraction or less abstraction? • Better molecular encoding method? – Easy to understand for human – Easy to understand for computer as well – Can describe most cases, and less corner cases. – Compromise between dream and reality
  • 15.
    Encoding molecule :SMILES Encoding molecule IUPAC nomenclature tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl) methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)- 2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]- 6-oxo-1-phenylhexan-2-yl]carbamate We can encode molecule • SMILES CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24 • InChI Made by IUPAC InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11 -15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/ h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3 … SMILES is a good encoding method for molecules
  • 16.
    What is SMILES? • Simplified Molecular Input Line Entry System – A linear representation of molecule using ASCII. – Conformation is also encoded – Human readable, and also machine readable. – Almost one-to-one mapping between a molecule and SMILES via universal SMILES • David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES • InChI by IUPAC – International Chemical Identifier : open standard (non proprietary) – NM O’Boyle invented “Universal SMILES” via InChI
  • 17.
    Example by SMILES http://en.wikipedia.org/wiki/SMILES 分子構造SMILES Nitrogen molecule N≡N N#N copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-] oenanthotoxin CCC[C@@H](O)CCC=CC=C C#CC#CC=CCO Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C )nc(N)2 Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c 2cc(OC)c4c3OC(=O)C5=C4CC C(=O)5
  • 18.
    Some corner cases Two different SMILES for Ferrocene • C12C3C4C5C1[Fe]23451234C5C1C2C3C45 • [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]
  • 19.
  • 20.
    Construction of abinitio chemical database • Molecular information is from PubChem • Properties are calculated from the first principle using computer – Many program packages are available – DFT (B3LYP) – 6-31G(d) basis set and geometry optimization – Excited states calculation by TD-DFT 6-31G+(d) – Best for organic molecules or bio molecules • Molecular encoding : SMILES / InChI • Huge computer resources • Dream come true – Google like search engine for chemistry
  • 21.
    The PubChemQC Project • http://pubchemqc.riken.jp/ • A open database for molecules – Public domain • Ab initio (The first principle) calculation of molecular properties of PubChem • 2014/1/15: 13,000 molecules • 2014/7/29 : 155,792 molecules • 2014/10/30 : 906,798 molecules • 2014/12/3 : 1,137,286 molecules
  • 22.
    The PubChemQC project http://pubchemqc.riken.jp/ WIP: no search engine, just data
  • 23.
  • 24.
  • 25.
    Related works •Related works – NIST Web Book • http://webbook.nist.gov/chemistry/ • Small numbers of molecules. Comparing many methods – Harvard Clean Energy Project • http://cleanenergy.molecularspace.org/ • 25,000,000 (?), molecules for photo devices made by combinatrics – Sugimoto et al :2013CBI symposium poster • Almost same as our database, currently not open to the public(now??)
  • 26.
    How we do? • Generate initial 3D conformation by OpenBABEL – SDF contains 3D conformation but we don’t use. – OpenBABEL –h (add hydrogen) --gen3d (generation of 3d coordinate) • Ab initio calculation by GAMESS+firefly – Using Gaussian can lead to a political problem(?) – PM3 optimization – Hartree-Fock/STO-6G geometry optimization – Firefly+GAMESS geometry optimization in B3LYP/6-31G* – Ten excitation energies by TDDFT/6-31G+* (no geom optimization)
  • 27.
    How we do? • Heavily using OpenBABEL • Extraction Molecular information – Sort by molecular weight of PubChem compouds – OpenBABEL • Encoded by SMILES – Isomeric smiles: 3D conformation retained – OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@ @H](O)1 – CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO – CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
  • 28.
    Our way topubchem Compound to quantum chemistry calculation aflatoxin O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5 Ab initio calculation by OpenBABEL
  • 29.
    Final results willbe • Uploaded to http://pubchemqc.riken.jp/ • Currently we upload – input file (ground / excited state) – Output file (ground / excited state) – Final geometry in Mol file
  • 30.
    Scaling of computation • Embarrassingly parallel for each molecule • Very roughly speaking, required time for calculation scales like N^4 – N : molecular weight • Problems are very hard (complexity theory) – Hartree-Fock calculation – DFT (b3lyp) calculation – geometry optimization • Practically many molecules can be solved efficiently
  • 31.
    Computer Resources •RICC : Intel Xeon 5570 Westmere, 2.93GHz 8 cores/node) x 1000 – 1000-10000 molecules/day (MW 160) – Heavily depend on conditions of other users – Time limit: 8 hours • Quest : Intel Core2 duo (1.6GHz/node) x 700 – 3000-8000 molecules / day (MW 160) – 100-1000 molecules / day (MW 200-300) – Time limit: 20 hours • Some compounds fail to calculate are ignored for this time.
  • 32.
    Computer Resources •Storage – Approx. 500GB for 1,000,000 molecules (xz compressed) – Approx. 20 TB for 40,000,000 molecules (xz compressed)
  • 33.
    Molecular weight andLipinski Rule • Lipinski’s five rule (Pfizer's rule of five): rule of thumb for drug discovery • No more than 5 hydrogen bond donors • Not more than 10 hydrogen bond acceptors • A molecular mass less than 500 daltons • An octanol-water partition coefficient log P not greater than 5 • Molecular weight should be smaller than 500 is very good for computational chemistry – For routine calculations without experimental data other than molecular formula – If larger than 500, secondary or higher structure becomes important. E.g., protein
  • 34.
    Molecular Weight distributionat PubChem Lipinski limit MW=500 We are still here 30,000,000 molecules (excluding mixtures)
  • 35.
    How long itwill take to finish? • For drug design, we need to calculate all molecules of MW < 500 • Total 30,000,000 molecules – This number may increase in the future • Current (2014/12/4) 1,100,000 molecules – Only 3% • 10,000 molecules/day -> 8.2years
  • 36.
    How long itwill take to finish? • 10+ years? No, maybe far less. • 25 years ago (1990) computers are so slow – Even ab initio calculations are very difficult on 486DX@25MHz or 68000@10MHz
  • 37.
    Outlook, prospect, hope… • Far better in silico screening – Less or no experiment is necessary • Even more faster calculation using machine learning – 10,000 molecules / second ? – Using our data as learning set. – Not difficult for bio or organic molecules – Far better initial guess • Database for chemical reaction – Precise calculation is required – GRRM method + machine learning (?) • Geometry optimization for Protein (PDB) – Only X ray crystal structures are available http://pubchemqc.riken.jp/
  • 38.
    Difficulties in thisproject • Parameters needed for calculations varies by molecules • Properties can be different by initial guess • Computer Resources – Raspberry Pi? NVIDIA Jetson? Bonic? • Molecular encoding never ends – SMILES or InChI is not complete – Some corner cases may be chemically interesting.