SlideShare a Scribd company logo
Representing molecules with minimalism:
A solution to the entropy of informatics
Alex M. Clark
⚡
COLLABORATIVE DRUG DISCOVERY
Introduction
◈ InChI is great if...
▷ applied within its domain
▷ used for the problems it is designed to solve
▷ supplied with appropriate data
2
J. Cheminf.
7:9 (2015)
Encoding Types
◈ 1D
◈ 2D
◈ 3D
3
identifier (InChI, SMILES)
conformation (QC, MM, crystals)
diagram (ChemDraw etc.)
representation (Molfile)
optimised for presentation
optimised for information
Sketch vs. Cheminformatics
◈ Most cheminformatics data comes in through
sketches:
▷ chemists use software designed to match
conventions dating back to 19th century
◈ Software is given a choice:
1. capture the stylistic choices (diagram primacy)
2. capture the fundamental meaning (data purity)
◈ We need to do both...
4
Minimalism
◈ Want to describe the intention of the scientist who has
made a discovery
▷ capture the fact without destroying information
▷ display it faithfully and recognisably
▷ compare to other molecules
▷ manipulate easily with many software tools
◈ Start with extremely minimalistic primitives...
◈ ... extend that with semi-optional metadata
5
Primitives
◈ Connection table style closely aligned with editor tools
◈ Atoms
▷ Symbol
▷ Isotope
▷ Charge (integral)
▷ Virtual hydrogens

(explicit vs. computed)
▷ X,Y coordinates
◈ Anything else is a nice-to-have...
6
◈ Bonds
▷ From, To
▷ Order (0, 1, 2, 3, 4, ...)
▷ Wedge: quasi-3D
(stereochemistry in situ)
Practical Solution
◈ An actual format is necessary...
◈ ... MDL Molfile format is the ad hoc standard
▷ originally poorly specified, though now much better
▷ but too late: literally thousands of scripts/products
▷ doesn't really matter what the spec says (cf. V3000)
◈ Effectively the lowest common denominator feature
set of software in current use...
◈ ... need to expand this subset, while deprecating.
7
OpenMolecule Idea
◈ Determine a level of nonstandard features necessary
▷ level 1.0: features that every MDL Molfile reader &
writer can handle with no information loss
▷ level 1.x: increasing likelihood of confusion
◈ Lowest level possible
▷ mark internally
▷ or quarantine
◈ Keep using Molfiles compatibly until replacement...
8
OpenMolecule Idea
◈ Determine a level of nonstandard features necessary
▷ level 1.0: features that every MDL Molfile reader &
writer can handle with no information loss
▷ level 1.x: increasing likelihood of confusion
◈ Lowest level possible
▷ mark internally
▷ or quarantine
◈ Keep using Molfiles compatibly until replacement...
8
Level 1.0: Minimal
◈ Lowest common denominator of all readers/writers
▷ the drug-like subset of organic chemistry
◈ To guarantee round-trip survival:
▷ Atoms are atomic symbols
▷ Bonds are 1, 2, 3
▷ Chirality by wedges (not parity)
▷ E/Z always specified by coordinates
▷ Implied hydrogens never in doubt
9
Level 1.1: Bigger
◈ 1000 atoms / bonds is a Molfile V2000 limit
◈ V3000 fixes this
◈ Limit rarely exceeded for most use cases
◈ V3000 is ugly, adoption slow
◈ Compliance means reading/writing V3000
10
Level 1.2: H-control, zero bonds
◈ Need to override virtual hydrogens and define a strict
rule for implicit hydrogen counting
◈ Need a bond type that doesn't affect valence
counting: allow 0-order
◈ Minimal changes - can describe a lot of inorganic
chemistry - see later for beautification
11
(Ⅱ)
(Ⅳ)
J. Chem. Inf. Model.
52, 3149-3157 (2011)
Inorganics
12
Inorganics
12
Inorganics
12
Inorganics
12
Molecule Metadata
◈ Metadata = optional information
▷ can beautify the structure, or assist interpretation
▷ but core representation is still meaningful &
renderable
◈ Can help interpretation and/or aesthetics, e.g.
13
resonance & symmetry
preferred visual style
Informatics Metadata
14
3c2e
Informatics Metadata
14
3c2e
dative
Informatics Metadata
14
3c2e
dative
+
resonance
Informatics Metadata
14
3c2e
dative
+
resonance
tautomer
tautomer
Stereochemistry
◈ Best to use 2D coordinates + wedges to derive...
◈ ... label/parity is metadata
15
S
R
S
R
S
S
R
R
cis trans
(R)-2,2′-Bis(bromomethyl)-1,1′-binaphthyl
Abbreviations
◈ Using abbreviations instead of all
heavy atoms must be accompanied
by a definition...
◈ ... everyone has their own
◈ Popular but nice-to-have in drug
discovery; essential elsewhere
◈ Support it via S-group in Molfile, inline
abbreviations in SketchEl
◈ Must be able to expand to get all atom
16
Enumeration Rules
◈ Formulae for multiple structures are not safe for naïve
interpretation (exception to minimalist approach)
17
Mixtures
xylenes
Enumeration Rules
◈ Formulae for multiple structures are not safe for naïve
interpretation (exception to minimalist approach)
17
Mixtures
xylenes
Markush
R1 = H, CH3
R2 = Me, Et, Pr, Bu
R3 = Ph, tolyl
X = N, CH
Y = C, N+
Z = Cl, CH2Cl
( )
n
Enumeration Rules
◈ Formulae for multiple structures are not safe for naïve
interpretation (exception to minimalist approach)
17
*
htn htn*
Cl
ran
Polymers
Mixtures
xylenes
Markush
R1 = H, CH3
R2 = Me, Et, Pr, Bu
R3 = Ph, tolyl
X = N, CH
Y = C, N+
Z = Cl, CH2Cl
( )
n
Future Work
◈ Slumming it with molfiles:
▷ agree on valid extensions
▷ self-awareness of limitations
◈ Lossless conversions between different formats
◈ Chart a course away from Molfiles...
18

More Related Content

Similar to Representing molecules with minimalism: A solution to the entropy of informatics

Flow-based programming with Elixir
Flow-based programming with ElixirFlow-based programming with Elixir
Flow-based programming with Elixir
Anton Mishchuk
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
elliando dias
 
DEF CON 27 - workshop - EIGENTOURIST - hacking with monads
DEF CON 27 - workshop - EIGENTOURIST - hacking with monadsDEF CON 27 - workshop - EIGENTOURIST - hacking with monads
DEF CON 27 - workshop - EIGENTOURIST - hacking with monads
Felipe Prado
 
Dna computing
Dna computingDna computing
Dna computing
sathish3
 
Deep MIML Network
Deep MIML NetworkDeep MIML Network
Deep MIML Network
Saad Elbeleidy
 
Machine learning for document analysis and understanding
Machine learning for document analysis and understandingMachine learning for document analysis and understanding
Machine learning for document analysis and understanding
Seiichi Uchida
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
NextMove Software
 
PF_MAO2010 Souma
PF_MAO2010 SoumaPF_MAO2010 Souma
PF_MAO2010 Souma
Souma Chowdhury
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
Andrei Alexandrescu
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
Greg Landrum
 
Legacy is Good
Legacy is GoodLegacy is Good
Legacy is Good
Uberto Barbini
 
Alex WANG - What is the most effective cryptosystem for public-key encryption?
Alex WANG - What is the most effective cryptosystem for public-key encryption?Alex WANG - What is the most effective cryptosystem for public-key encryption?
Alex WANG - What is the most effective cryptosystem for public-key encryption?
AlexWang212277
 
Verilog HDL
Verilog HDL Verilog HDL
Verilog HDL
HasmukhPKoringa
 
Large Partially-connected Erlang Clusters
 Large Partially-connected Erlang Clusters Large Partially-connected Erlang Clusters
Large Partially-connected Erlang Clusters
Motiejus Jakštys
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
SARCCOM
 
Dna computing
Dna computingDna computing
Dna computing
Shashwat Shriparv
 
Bio_Computing
Bio_ComputingBio_Computing
Big Decimal: Avoid Rounding Errors on Decimals in JavaScript
Big Decimal: Avoid Rounding Errors on Decimals in JavaScriptBig Decimal: Avoid Rounding Errors on Decimals in JavaScript
Big Decimal: Avoid Rounding Errors on Decimals in JavaScript
Igalia
 
Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
aiaioo
 
4 1 tree world
4 1 tree world4 1 tree world
4 1 tree world
Leonardo Auslender
 

Similar to Representing molecules with minimalism: A solution to the entropy of informatics (20)

Flow-based programming with Elixir
Flow-based programming with ElixirFlow-based programming with Elixir
Flow-based programming with Elixir
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
DEF CON 27 - workshop - EIGENTOURIST - hacking with monads
DEF CON 27 - workshop - EIGENTOURIST - hacking with monadsDEF CON 27 - workshop - EIGENTOURIST - hacking with monads
DEF CON 27 - workshop - EIGENTOURIST - hacking with monads
 
Dna computing
Dna computingDna computing
Dna computing
 
Deep MIML Network
Deep MIML NetworkDeep MIML Network
Deep MIML Network
 
Machine learning for document analysis and understanding
Machine learning for document analysis and understandingMachine learning for document analysis and understanding
Machine learning for document analysis and understanding
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
PF_MAO2010 Souma
PF_MAO2010 SoumaPF_MAO2010 Souma
PF_MAO2010 Souma
 
Generic Programming Galore Using D
Generic Programming Galore Using DGeneric Programming Galore Using D
Generic Programming Galore Using D
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Legacy is Good
Legacy is GoodLegacy is Good
Legacy is Good
 
Alex WANG - What is the most effective cryptosystem for public-key encryption?
Alex WANG - What is the most effective cryptosystem for public-key encryption?Alex WANG - What is the most effective cryptosystem for public-key encryption?
Alex WANG - What is the most effective cryptosystem for public-key encryption?
 
Verilog HDL
Verilog HDL Verilog HDL
Verilog HDL
 
Large Partially-connected Erlang Clusters
 Large Partially-connected Erlang Clusters Large Partially-connected Erlang Clusters
Large Partially-connected Erlang Clusters
 
Week 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-OnWeek 3 Deep Learning And POS Tagging Hands-On
Week 3 Deep Learning And POS Tagging Hands-On
 
Dna computing
Dna computingDna computing
Dna computing
 
Bio_Computing
Bio_ComputingBio_Computing
Bio_Computing
 
Big Decimal: Avoid Rounding Errors on Decimals in JavaScript
Big Decimal: Avoid Rounding Errors on Decimals in JavaScriptBig Decimal: Avoid Rounding Errors on Decimals in JavaScript
Big Decimal: Avoid Rounding Errors on Decimals in JavaScript
 
Document Analysis with Deep Learning
Document Analysis with Deep LearningDocument Analysis with Deep Learning
Document Analysis with Deep Learning
 
4 1 tree world
4 1 tree world4 1 tree world
4 1 tree world
 

More from Alex Clark

Mixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicalsMixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicals
Alex Clark
 
Mixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsMixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream products
Alex Clark
 
Mixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informaticsMixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informatics
Alex Clark
 
Mixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsMixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer products
Alex Clark
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)
Alex Clark
 
Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...
Alex Clark
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
Alex Clark
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)
Alex Clark
 
Autonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAutonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocols
Alex Clark
 
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
Alex Clark
 
BioAssay Express
BioAssay ExpressBioAssay Express
BioAssay Express
Alex Clark
 
SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?
Alex Clark
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithms
Alex Clark
 
Compact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsCompact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile apps
Alex Clark
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
Alex Clark
 
ICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab Notebook
Alex Clark
 
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Alex Clark
 
Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)
Alex Clark
 
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Alex Clark
 
Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013
Alex Clark
 

More from Alex Clark (20)

Mixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicalsMixtures QSAR: modelling collections of chemicals
Mixtures QSAR: modelling collections of chemicals
 
Mixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream productsMixtures InChI: a story of how standards drive upstream products
Mixtures InChI: a story of how standards drive upstream products
 
Mixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informaticsMixtures as first class citizens in the realm of informatics
Mixtures as first class citizens in the realm of informatics
 
Mixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer productsMixtures: informatics for formulations and consumer products
Mixtures: informatics for formulations and consumer products
 
Coordination InChI (2019)
Coordination InChI (2019)Coordination InChI (2019)
Coordination InChI (2019)
 
Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemical mixtures: File format, open source tools, example data, and mixtures...
 
Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...Bringing bioassay protocols to the world of informatics, using semantic annot...
Bringing bioassay protocols to the world of informatics, using semantic annot...
 
ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)ACS CINF Luncheon talk (Boston 2018)
ACS CINF Luncheon talk (Boston 2018)
 
Autonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocolsAutonomous model building with a preponderance of well annotated assay protocols
Autonomous model building with a preponderance of well annotated assay protocols
 
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
 
BioAssay Express
BioAssay ExpressBioAssay Express
BioAssay Express
 
SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?SLAS2016: Why have one model when you could have thousands?
SLAS2016: Why have one model when you could have thousands?
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithms
 
Compact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile appsCompact models for compact devices: Visualisation of SAR using mobile apps
Compact models for compact devices: Visualisation of SAR using mobile apps
 
Green chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by designGreen chemistry in chemical reactions: informatics by design
Green chemistry in chemical reactions: informatics by design
 
ICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab NotebookICCE 2014: The Green Lab Notebook
ICCE 2014: The Green Lab Notebook
 
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
 
Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)Building a mobile reaction lab notebook (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)
 
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
 
Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013Alex Clark : NETTAB 2013
Alex Clark : NETTAB 2013
 

Recently uploaded

8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
Karen593256
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 

Recently uploaded (20)

8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
Modelo de slide quimica para powerpoint
Modelo  de slide quimica para powerpointModelo  de slide quimica para powerpoint
Modelo de slide quimica para powerpoint
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 

Representing molecules with minimalism: A solution to the entropy of informatics

  • 1. Representing molecules with minimalism: A solution to the entropy of informatics Alex M. Clark ⚡ COLLABORATIVE DRUG DISCOVERY
  • 2. Introduction ◈ InChI is great if... ▷ applied within its domain ▷ used for the problems it is designed to solve ▷ supplied with appropriate data 2 J. Cheminf. 7:9 (2015)
  • 3. Encoding Types ◈ 1D ◈ 2D ◈ 3D 3 identifier (InChI, SMILES) conformation (QC, MM, crystals) diagram (ChemDraw etc.) representation (Molfile) optimised for presentation optimised for information
  • 4. Sketch vs. Cheminformatics ◈ Most cheminformatics data comes in through sketches: ▷ chemists use software designed to match conventions dating back to 19th century ◈ Software is given a choice: 1. capture the stylistic choices (diagram primacy) 2. capture the fundamental meaning (data purity) ◈ We need to do both... 4
  • 5. Minimalism ◈ Want to describe the intention of the scientist who has made a discovery ▷ capture the fact without destroying information ▷ display it faithfully and recognisably ▷ compare to other molecules ▷ manipulate easily with many software tools ◈ Start with extremely minimalistic primitives... ◈ ... extend that with semi-optional metadata 5
  • 6. Primitives ◈ Connection table style closely aligned with editor tools ◈ Atoms ▷ Symbol ▷ Isotope ▷ Charge (integral) ▷ Virtual hydrogens
 (explicit vs. computed) ▷ X,Y coordinates ◈ Anything else is a nice-to-have... 6 ◈ Bonds ▷ From, To ▷ Order (0, 1, 2, 3, 4, ...) ▷ Wedge: quasi-3D (stereochemistry in situ)
  • 7. Practical Solution ◈ An actual format is necessary... ◈ ... MDL Molfile format is the ad hoc standard ▷ originally poorly specified, though now much better ▷ but too late: literally thousands of scripts/products ▷ doesn't really matter what the spec says (cf. V3000) ◈ Effectively the lowest common denominator feature set of software in current use... ◈ ... need to expand this subset, while deprecating. 7
  • 8. OpenMolecule Idea ◈ Determine a level of nonstandard features necessary ▷ level 1.0: features that every MDL Molfile reader & writer can handle with no information loss ▷ level 1.x: increasing likelihood of confusion ◈ Lowest level possible ▷ mark internally ▷ or quarantine ◈ Keep using Molfiles compatibly until replacement... 8
  • 9. OpenMolecule Idea ◈ Determine a level of nonstandard features necessary ▷ level 1.0: features that every MDL Molfile reader & writer can handle with no information loss ▷ level 1.x: increasing likelihood of confusion ◈ Lowest level possible ▷ mark internally ▷ or quarantine ◈ Keep using Molfiles compatibly until replacement... 8
  • 10. Level 1.0: Minimal ◈ Lowest common denominator of all readers/writers ▷ the drug-like subset of organic chemistry ◈ To guarantee round-trip survival: ▷ Atoms are atomic symbols ▷ Bonds are 1, 2, 3 ▷ Chirality by wedges (not parity) ▷ E/Z always specified by coordinates ▷ Implied hydrogens never in doubt 9
  • 11. Level 1.1: Bigger ◈ 1000 atoms / bonds is a Molfile V2000 limit ◈ V3000 fixes this ◈ Limit rarely exceeded for most use cases ◈ V3000 is ugly, adoption slow ◈ Compliance means reading/writing V3000 10
  • 12. Level 1.2: H-control, zero bonds ◈ Need to override virtual hydrogens and define a strict rule for implicit hydrogen counting ◈ Need a bond type that doesn't affect valence counting: allow 0-order ◈ Minimal changes - can describe a lot of inorganic chemistry - see later for beautification 11 (Ⅱ) (Ⅳ) J. Chem. Inf. Model. 52, 3149-3157 (2011)
  • 17. Molecule Metadata ◈ Metadata = optional information ▷ can beautify the structure, or assist interpretation ▷ but core representation is still meaningful & renderable ◈ Can help interpretation and/or aesthetics, e.g. 13 resonance & symmetry preferred visual style
  • 22. Stereochemistry ◈ Best to use 2D coordinates + wedges to derive... ◈ ... label/parity is metadata 15 S R S R S S R R cis trans (R)-2,2′-Bis(bromomethyl)-1,1′-binaphthyl
  • 23. Abbreviations ◈ Using abbreviations instead of all heavy atoms must be accompanied by a definition... ◈ ... everyone has their own ◈ Popular but nice-to-have in drug discovery; essential elsewhere ◈ Support it via S-group in Molfile, inline abbreviations in SketchEl ◈ Must be able to expand to get all atom 16
  • 24. Enumeration Rules ◈ Formulae for multiple structures are not safe for naïve interpretation (exception to minimalist approach) 17 Mixtures xylenes
  • 25. Enumeration Rules ◈ Formulae for multiple structures are not safe for naïve interpretation (exception to minimalist approach) 17 Mixtures xylenes Markush R1 = H, CH3 R2 = Me, Et, Pr, Bu R3 = Ph, tolyl X = N, CH Y = C, N+ Z = Cl, CH2Cl ( ) n
  • 26. Enumeration Rules ◈ Formulae for multiple structures are not safe for naïve interpretation (exception to minimalist approach) 17 * htn htn* Cl ran Polymers Mixtures xylenes Markush R1 = H, CH3 R2 = Me, Et, Pr, Bu R3 = Ph, tolyl X = N, CH Y = C, N+ Z = Cl, CH2Cl ( ) n
  • 27. Future Work ◈ Slumming it with molfiles: ▷ agree on valid extensions ▷ self-awareness of limitations ◈ Lossless conversions between different formats ◈ Chart a course away from Molfiles... 18