MS2PROP
(biosortia2prop)
QED properties and lipinski
https://github.com/patrickchirdon/biosortia
Key finding: ms2prop had an r2 of .73 on the
independent test set across all the QED properties, we
have an r2 of 88% (we beat envedabio)
Additions to biosortia2prop in progress:
Open source
Calculator would
Bring people to the
Biosortia web site
And would require people
To cite you if they used it
Could also build
additional
Chembl models on
Request.
Next goal-- finish the
Calculator, screen your
Compounds using the
PASS program and find
Targets!
Methods
 Methods

 Lasso regression-- least absolute shrinkage and selection operation regression is a regularized version of linear regression. It adds a regularization term to the cost
function using the l1 norm of the weight vector. An important characteristic of lasso regression is that it tends to completely eliminate the weights of the least important
features (set them to 0). Kasso regression automatically performs feature selection and outputs a sparse model.

 elastic- elastic net is a middle ground between ridge regression and lasso regression. the regularization term is a simple mix of both ridge and lasso's regularization terms,
and you can control the mix ratio r. when r=0, elastic net is equivalent o ridge regression, and when r=1, it is equivalent to lasso regression. ridge regression is a good
default to use, but if you suspect that only a few features are actually useful, you should prefer lasso or elastic since they tend to reduce the useless features' weights down
to zero. In general elastic is preferred over lasso since lasso may behave erratically when the number of features is greater than the number of training instances or when
several features are strongly correlated.

 How good is a regression?

 statisticians have come up with a tool that’s easy to understand. It is called r^2. Typically, R square is looked at as a percentage value, and it can range from 0% to 100%.
The higher it is,the greater the explanatory power of the regression model (the lower the weight of unexplained squares, the better the model).

 https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee
Scikitlearn models of QED test data
 for solubility lasso lars had r^2 of 1, lasso r^2 of 1, and elastic r^2 of 1 on test data
 bioavailability-- lasso lars had a r^2 of 1 on test data, lasso r^2 of 1, and elastic r^2 of 1.
 for solubility lasso lars had r^2 of 1, lasso r^2 of 1, and elastic r^2 of 1 on test data.
 for fraction of sp3 hybridized carbons (a measure of selectivity of binding)-- lasso lars had
a r^2 of 1 on training data (first model for lasso lars to work), lasso r^2 of 1, and elastic r^2
of 1.
 mglur5 (autism target), hsp90a (neuroinflammatory target, charcot marie tooth disease),
calpain 1 (covid 19 target), aphid mortality model. r^2 of 1 with elastic, lasso, and
lasso_lars models
Lipinski druglikeness
 sensitivity 93.7 %, specificity 60%, accuracy
87% when doing the lipinski classification.
 QED is lipinski plus aromatic ring count and
med chem rules (SMARTS)
Test data excluded
from
Training data We trained a keras neural net on the
GNPS natural products mass spec
database
Test data excluded
from
Training data
for CNS compounds, moderately polar (PSA<79 Å2) and relatively lipophilic (log P from +0.4 to +6.0) molecules have a high probability to access the
CNS.
FDA Test Set (independent dataset)
The Egan rule considers good bioavailability for compounds with 0 ≥ tPSA ≤ 132 Å2 and -1≥ logP ≤ 6 [15]
for GI adsorption, PSA lower than 142 Å2 and log P between −2.3 and +6.8.
Pesticides
Egan’s rule holds for pesticides.
EPA environmental toxicity calculator
https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-
test
Egan’s rule can differentiate CNS compounds
from non CNS compounds
for CNS compounds, moderately polar (PSA<79 Å2) and relatively lipophilic (log P from +0.4 to +6.0) molecules have a high probability to access the
CNS.
Molecular Descriptors for Chemoinformatics
 Todeschini and Consonni
Properties of
Pesticides:
https://www.uky.edu/
Ag/Entomology/PSE
P/6environment.html
When trying to find a target all you need is a
pharmacophore. Since we know the structure for 140 we
can find the pharmacophores.
https://academic.oup.com/nar/article/45/W1/W356/379121
3
https://hmdb.ca/metabolites?utf8=%E2%9C%93&quantified=1&blood=1&urine=1&saliva=1&csf=1&feces=1&sweat=1&breast_milk=1&bi
le=1&amniotic_fluid=1&other_fluids=1&microbial=1&filter=true
140 structures from link above
Malaria predictor
http://chembl.blogspot.com/2020/
05/malaria-inhibitor-prediction-
platform.html
Antiviral predictor
http://crdd.osdd.net/servers/avcpred/
Molecular Descriptors for Chemoinformatics
Gustafson, DI (1989). Groundwater Ubiquity Score: A Simple Method for assessing pesticide leachability. EnvironToxicol Chem, 339-357.
Papa, E, Castiglioni, S, Gramatica, P, Nikolayenko, V, Kayumov, O and Calamari, D. (2004) Screening the leaching tendency of pesticides applied in the Amu Darya Basin (Uzbekistan) Water Res, 38, 3485-3491.
Laskonoski, DA, Goring, CAI, McCall, PJ and Swann, RL (1982). Terrestrial Environment, in Environmental Analysis for Chemicals (ed. RA Conway), Van Norstrand Reinhold Company, New York, pp 198-240.
Gramatica, P and DiGuardo, A (2002). Screening of Pesticides for environmental partitioning tendency. Chemosphere, 47, 947-956.
Papa, E, Castiglioni, S, Gramatica, P, Nikolayenko, V. Kayumov, o and calamari, D (2004). Screening the leaching tendency of pesticides applied in the Amu Darya Basin (Uzbekistan). Water Res, 38, 3485-3494.
Wingnet, P, Cramer, CJ and Truhlar, DJ (2000). Prediction of soil sorption coefficients using universal solvation model. Environ. Sci technol, 34, 4733-4740.
Andrews, PR, Craik, DJ and Martin JL (1984) Functional Group Contributions to drug receptor interactions. J Med. Chem, 27, 1648- 1657.
Muegge, O (2002). Pharmacophore features of potential drugs Chem. Eur. J, 8 1977-1981.
Muegge, I (2003). Selection Criteria for druglike compounds. Med. Res. Rev, 23, 302-321.
Muegge, I, Heald, Sl and Brittelli. (2001). Simple Selection Criteria for druglike chemical matter. J. Med Chem. 44. 1841-1846.
 Compounds of interest---
 Ulvan -- https://www.jpost.com/health-and-wellness/could-seaweed-save-humanity-from-covid-19-687775
https://onlinelibrary.wiley.com/doi/epdf/10.1002/adma.202206367 Ulvan is a plastic similar to PEI. It's useful for covid 19 and for agriculture viruses.
https://www.newscientist.com/article/2341170-battery-made-using-seaweed-still-works-after-charging-1000-
times/?utm_medium=social&utm_campaign=echobox&utm_source=Facebook#Echobox=1665398953
https://www.sciencedirect.com/science/article/pii/S2211926418308373
 Alginate nanoparticles coated with antibodies for tumors-- partner with freenome?
https://www.science.org/doi/10.1126/science.abq6990?utm_campaign=SciMag&utm_medium=ownedSocial&utm_source=LinkedIn&cookieSet=1
https://www.freenome.com/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6432598/
 Gratzel cells https://www.researchgate.net/publication/256549423_Brown_seaweed_pigment_as_a_dye_source_for_photoelectrochemical_solar_cells
 omega 3
 sunscreen
anti-inflammatories
 https://www.newscientist.com/article/2355262-vagus-nerve-receptors-may-be-key-to-controlling-inflammation/
 https://uantwerpen.vib.be/group/VincentTimmerman
 We made a hsp90a model with an r^2 of 1.
 Heat shock protein dis-regulation is a key part of neuroinflammation in charcot marie tooth disease
 We also have enough data to build a c-myc inhibitor-- cancer target and anti-aging target (see old mouse become young mouse)
 https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-020-01291-6

 https://rupress.org/jcb/article/220/8/e202103090/212429/The-long-journey-to-bring-a-Myc-inhibitor-to-the

 https://time.com/6246864/reverse-aging-scientists-discover-milestone/

 https://www.nationalgeographic.com/magazine/science/article/zombie-cells-could-hold-the-secret-to-alzheimers-cure
 potential partnerships--
 https://www.calicolabs.com/ (anti-aging)
 https://insilico.com/ (phase 0 clinical trial in less than a month)
 https://www.envedabio.com/ (our algorithm beats envedabio’s ms2prop algorithm)
 https://www.harringtondiscovery.org/ (local, Cleveland University Hospitals, rare and orphan disease drug discovery)
 https://www.energy.gov/osdbu/small-business-toolbox
 https://www.trialspark.com/ (CRO for matching labs to clinical trials)
 Tillerman lab in Belgium? Charcot Marie Tooth Disease lab
https://oig.hhs.gov/oei/reports/oei-09-00-00380.pdf
RESOURCES
https://www.rosettacommons.org/
https://openmolecules.org/datawarrior/
https://pubchem.ncbi.nlm.nih.gov//edit3/index.html
Databases--
GNPS library-- https://ccms-ucsd.github.io/GNPSDocumentation/gnpslibraries/
https://ec.europa.eu/food/plant/pesticides/eu-pesticides-database/start/screen/active-substances
https://cbirt.net/meta-ai-releases-esm-metagenomic-atlas-a-repository-of-over-600-million-predicted-protein-
structures/
https://comptox.epa.gov/genra/
https://www.metaboanalyst.ca/MetaboAnalyst/ModuleView.xhtml
https://ipb-halle.github.io/MetFrag/projects/metfragweb/
https://hmdb.ca/spectra/ms_ms/search
https://pubchem.ncbi.nlm.nih.gov/
https://www.ebi.ac.uk/chembl/
https://ochem.eu/home/show.do
https://alphafold.ebi.ac.uk/
https://www.rcsb.org/
https://cfmid.wishartlab.com/predict
https://massbank.eu/MassBank/Search
http://www.swissadme.ch
books
 https://books.google.com/books/about/Why_Digital_Transformations_Fail.html?id=L_T1uwEACAAJ
 https://www.linkedin.com/feed/update/urn:li:activity:7006165405937373184/?utm_source=share&utm_medium
=member_desktop
 https://books.google.com/books/about/Why_Digital_Transformations_Fail.html?id=L_T1uwEACAAJ
 https://mml-book.github.io/book/mml-book.pdf
https://www.google.com/books/edition/Hands_On_Machine_Learning_with_Scikit_Le/HnetDwAAQBAJ?hl=en
&gbpv=1&dq=hands+on+machine+learning&printsec=frontcover chemoinformatics--

 https://www.google.com/books/edition/Molecular_Descriptors_for_Chemoinformati/6Zp7Yrqzv8AC?hl=en&gbp
v=1&dq=molecular+descriptors+for+chemoinformatics&printsec=frontcover
Thank you
tasty treats
https://www.submariner-network.eu/recipe-for-a-veggie-burger-with-
algae

biosortia2prop.pptx

  • 1.
    MS2PROP (biosortia2prop) QED properties andlipinski https://github.com/patrickchirdon/biosortia Key finding: ms2prop had an r2 of .73 on the independent test set across all the QED properties, we have an r2 of 88% (we beat envedabio)
  • 2.
  • 3.
    Open source Calculator would Bringpeople to the Biosortia web site And would require people To cite you if they used it Could also build additional Chembl models on Request. Next goal-- finish the Calculator, screen your Compounds using the PASS program and find Targets!
  • 4.
    Methods  Methods   Lassoregression-- least absolute shrinkage and selection operation regression is a regularized version of linear regression. It adds a regularization term to the cost function using the l1 norm of the weight vector. An important characteristic of lasso regression is that it tends to completely eliminate the weights of the least important features (set them to 0). Kasso regression automatically performs feature selection and outputs a sparse model.   elastic- elastic net is a middle ground between ridge regression and lasso regression. the regularization term is a simple mix of both ridge and lasso's regularization terms, and you can control the mix ratio r. when r=0, elastic net is equivalent o ridge regression, and when r=1, it is equivalent to lasso regression. ridge regression is a good default to use, but if you suspect that only a few features are actually useful, you should prefer lasso or elastic since they tend to reduce the useless features' weights down to zero. In general elastic is preferred over lasso since lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.   How good is a regression?   statisticians have come up with a tool that’s easy to understand. It is called r^2. Typically, R square is looked at as a percentage value, and it can range from 0% to 100%. The higher it is,the greater the explanatory power of the regression model (the lower the weight of unexplained squares, the better the model).   https://medium.com/wwblog/evaluating-regression-models-using-rmse-and-r%C2%B2-42f77400efee
  • 5.
    Scikitlearn models ofQED test data  for solubility lasso lars had r^2 of 1, lasso r^2 of 1, and elastic r^2 of 1 on test data  bioavailability-- lasso lars had a r^2 of 1 on test data, lasso r^2 of 1, and elastic r^2 of 1.  for solubility lasso lars had r^2 of 1, lasso r^2 of 1, and elastic r^2 of 1 on test data.  for fraction of sp3 hybridized carbons (a measure of selectivity of binding)-- lasso lars had a r^2 of 1 on training data (first model for lasso lars to work), lasso r^2 of 1, and elastic r^2 of 1.  mglur5 (autism target), hsp90a (neuroinflammatory target, charcot marie tooth disease), calpain 1 (covid 19 target), aphid mortality model. r^2 of 1 with elastic, lasso, and lasso_lars models
  • 6.
    Lipinski druglikeness  sensitivity93.7 %, specificity 60%, accuracy 87% when doing the lipinski classification.  QED is lipinski plus aromatic ring count and med chem rules (SMARTS)
  • 7.
    Test data excluded from Trainingdata We trained a keras neural net on the GNPS natural products mass spec database
  • 8.
  • 9.
    for CNS compounds,moderately polar (PSA<79 Å2) and relatively lipophilic (log P from +0.4 to +6.0) molecules have a high probability to access the CNS.
  • 11.
    FDA Test Set(independent dataset) The Egan rule considers good bioavailability for compounds with 0 ≥ tPSA ≤ 132 Å2 and -1≥ logP ≤ 6 [15] for GI adsorption, PSA lower than 142 Å2 and log P between −2.3 and +6.8.
  • 12.
    Pesticides Egan’s rule holdsfor pesticides. EPA environmental toxicity calculator https://www.epa.gov/chemical-research/toxicity-estimation-software-tool- test
  • 13.
    Egan’s rule candifferentiate CNS compounds from non CNS compounds for CNS compounds, moderately polar (PSA<79 Å2) and relatively lipophilic (log P from +0.4 to +6.0) molecules have a high probability to access the CNS.
  • 14.
    Molecular Descriptors forChemoinformatics  Todeschini and Consonni
  • 16.
  • 17.
    When trying tofind a target all you need is a pharmacophore. Since we know the structure for 140 we can find the pharmacophores. https://academic.oup.com/nar/article/45/W1/W356/379121 3 https://hmdb.ca/metabolites?utf8=%E2%9C%93&quantified=1&blood=1&urine=1&saliva=1&csf=1&feces=1&sweat=1&breast_milk=1&bi le=1&amniotic_fluid=1&other_fluids=1&microbial=1&filter=true 140 structures from link above Malaria predictor http://chembl.blogspot.com/2020/ 05/malaria-inhibitor-prediction- platform.html Antiviral predictor http://crdd.osdd.net/servers/avcpred/
  • 22.
    Molecular Descriptors forChemoinformatics Gustafson, DI (1989). Groundwater Ubiquity Score: A Simple Method for assessing pesticide leachability. EnvironToxicol Chem, 339-357. Papa, E, Castiglioni, S, Gramatica, P, Nikolayenko, V, Kayumov, O and Calamari, D. (2004) Screening the leaching tendency of pesticides applied in the Amu Darya Basin (Uzbekistan) Water Res, 38, 3485-3491. Laskonoski, DA, Goring, CAI, McCall, PJ and Swann, RL (1982). Terrestrial Environment, in Environmental Analysis for Chemicals (ed. RA Conway), Van Norstrand Reinhold Company, New York, pp 198-240. Gramatica, P and DiGuardo, A (2002). Screening of Pesticides for environmental partitioning tendency. Chemosphere, 47, 947-956. Papa, E, Castiglioni, S, Gramatica, P, Nikolayenko, V. Kayumov, o and calamari, D (2004). Screening the leaching tendency of pesticides applied in the Amu Darya Basin (Uzbekistan). Water Res, 38, 3485-3494. Wingnet, P, Cramer, CJ and Truhlar, DJ (2000). Prediction of soil sorption coefficients using universal solvation model. Environ. Sci technol, 34, 4733-4740. Andrews, PR, Craik, DJ and Martin JL (1984) Functional Group Contributions to drug receptor interactions. J Med. Chem, 27, 1648- 1657. Muegge, O (2002). Pharmacophore features of potential drugs Chem. Eur. J, 8 1977-1981. Muegge, I (2003). Selection Criteria for druglike compounds. Med. Res. Rev, 23, 302-321. Muegge, I, Heald, Sl and Brittelli. (2001). Simple Selection Criteria for druglike chemical matter. J. Med Chem. 44. 1841-1846.
  • 23.
     Compounds ofinterest---  Ulvan -- https://www.jpost.com/health-and-wellness/could-seaweed-save-humanity-from-covid-19-687775 https://onlinelibrary.wiley.com/doi/epdf/10.1002/adma.202206367 Ulvan is a plastic similar to PEI. It's useful for covid 19 and for agriculture viruses. https://www.newscientist.com/article/2341170-battery-made-using-seaweed-still-works-after-charging-1000- times/?utm_medium=social&utm_campaign=echobox&utm_source=Facebook#Echobox=1665398953 https://www.sciencedirect.com/science/article/pii/S2211926418308373  Alginate nanoparticles coated with antibodies for tumors-- partner with freenome? https://www.science.org/doi/10.1126/science.abq6990?utm_campaign=SciMag&utm_medium=ownedSocial&utm_source=LinkedIn&cookieSet=1 https://www.freenome.com/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6432598/  Gratzel cells https://www.researchgate.net/publication/256549423_Brown_seaweed_pigment_as_a_dye_source_for_photoelectrochemical_solar_cells  omega 3  sunscreen
  • 24.
    anti-inflammatories  https://www.newscientist.com/article/2355262-vagus-nerve-receptors-may-be-key-to-controlling-inflammation/  https://uantwerpen.vib.be/group/VincentTimmerman We made a hsp90a model with an r^2 of 1.  Heat shock protein dis-regulation is a key part of neuroinflammation in charcot marie tooth disease  We also have enough data to build a c-myc inhibitor-- cancer target and anti-aging target (see old mouse become young mouse)  https://molecular-cancer.biomedcentral.com/articles/10.1186/s12943-020-01291-6   https://rupress.org/jcb/article/220/8/e202103090/212429/The-long-journey-to-bring-a-Myc-inhibitor-to-the   https://time.com/6246864/reverse-aging-scientists-discover-milestone/   https://www.nationalgeographic.com/magazine/science/article/zombie-cells-could-hold-the-secret-to-alzheimers-cure
  • 25.
     potential partnerships-- https://www.calicolabs.com/ (anti-aging)  https://insilico.com/ (phase 0 clinical trial in less than a month)  https://www.envedabio.com/ (our algorithm beats envedabio’s ms2prop algorithm)  https://www.harringtondiscovery.org/ (local, Cleveland University Hospitals, rare and orphan disease drug discovery)  https://www.energy.gov/osdbu/small-business-toolbox  https://www.trialspark.com/ (CRO for matching labs to clinical trials)  Tillerman lab in Belgium? Charcot Marie Tooth Disease lab https://oig.hhs.gov/oei/reports/oei-09-00-00380.pdf
  • 26.
    RESOURCES https://www.rosettacommons.org/ https://openmolecules.org/datawarrior/ https://pubchem.ncbi.nlm.nih.gov//edit3/index.html Databases-- GNPS library-- https://ccms-ucsd.github.io/GNPSDocumentation/gnpslibraries/ https://ec.europa.eu/food/plant/pesticides/eu-pesticides-database/start/screen/active-substances https://cbirt.net/meta-ai-releases-esm-metagenomic-atlas-a-repository-of-over-600-million-predicted-protein- structures/ https://comptox.epa.gov/genra/ https://www.metaboanalyst.ca/MetaboAnalyst/ModuleView.xhtml https://ipb-halle.github.io/MetFrag/projects/metfragweb/ https://hmdb.ca/spectra/ms_ms/search https://pubchem.ncbi.nlm.nih.gov/ https://www.ebi.ac.uk/chembl/ https://ochem.eu/home/show.do https://alphafold.ebi.ac.uk/ https://www.rcsb.org/ https://cfmid.wishartlab.com/predict https://massbank.eu/MassBank/Search http://www.swissadme.ch
  • 27.
    books  https://books.google.com/books/about/Why_Digital_Transformations_Fail.html?id=L_T1uwEACAAJ  https://www.linkedin.com/feed/update/urn:li:activity:7006165405937373184/?utm_source=share&utm_medium =member_desktop https://books.google.com/books/about/Why_Digital_Transformations_Fail.html?id=L_T1uwEACAAJ  https://mml-book.github.io/book/mml-book.pdf https://www.google.com/books/edition/Hands_On_Machine_Learning_with_Scikit_Le/HnetDwAAQBAJ?hl=en &gbpv=1&dq=hands+on+machine+learning&printsec=frontcover chemoinformatics--   https://www.google.com/books/edition/Molecular_Descriptors_for_Chemoinformati/6Zp7Yrqzv8AC?hl=en&gbp v=1&dq=molecular+descriptors+for+chemoinformatics&printsec=frontcover
  • 28.