SlideShare a Scribd company logo
1 of 19
So I have an SD File …
What do I do next?
Rajarshi Guha & Noel O’Boyle
NCATS & NextMove Software
ACS National Meeting, Boston 2015
What do you want to do?
What is the core issue?
• What you see on a
screen isn’t necessarily
what you get in a file
• Need to be aware of
how certain chemical
concepts are handled in
software
Tasks to be considered
• Searching for structures
• Managing inventory
• Linking / merging
structure data to other
data
• Predicting properties or
analysis of bioactivity
data
Which file format for data storage?
● The answer to this question is never XYZ or PDB
o Don’t use a file format that throws away parts of
your chemical structure (connectivity, bond orders
or formal charges)
o Software has to guess the missing information
● And probably not InChI
o Without the ‘AuxInfo’, the chemical structure
obtained from an InChI is not necessarily the same
as the original (e.g. amides to imidic acids)
● SMILES and MOL are your go-to formats
● Widely supported (i.e. portable), can recreate the
original structure
The question of identity
● A file format is not the same as an identifier
o The same molecule can be represented in different
ways, even in the same format
● A “canonical” representation is required
○ To check identity, find or avoid duplicates, find overlap
of two databases or check that a structure remains
unchanged (e.g. after some transformation)
● Only InChI (and IUPAC names) are canonical by
definition, but canonical versions of other
formats can be generated
C C O C C O
Ethanol can be represented in SMILES format as CCO or OCC (among others)
Canonical SMILES
● Atom order is the same whatever the input
● BUT, every toolkit has its own canonicalization
algorithm (which may change over time)
○ Consistent within the toolkit, not neccesarily
outside
● Don’t assume that a given SMILES is in a
canonical form
○ If necessary, canonicalize them yourself
Ethanol as CCO, OCC, C(O)C all converted to CCO (by Toolkit#1)
Ethanol as CCO, OCC, C(O)C all converted to OCC (by Toolkit#2)
Depictions vs computers
● Are your structures drawn for humans or computers?
○ There are 2D depictions of stereochemistry that are instantly
interpretable by a human but which are commonly
misinterpreted by software
● Chirality of (a) is opposite to (c)
○ But what is the chirality of (b)?
● Possibilities:
○ Undefined (according to InChI, if close to 180°)
○ Same as (a) or (c) depending on which side of 180°
Rings with ‘implicit’ 3D
You drew You meant You may get
Tetrahedral stereo gotchas
● R/S in IUPAC names, @/@@ in SMILES, 1/2 in
MOL files, +/- in InChIs
● None of these directly correspond to another
○ SMILES and Mol files describe stereo in terms of atom
order, but differ in where implicit hydrogens are
located
○ InChI and IUPAC names both use a complex algorithm
to determine the symbol
● Only two of these formats may always be used to
compare two structures:
○ R/S and /m layer (InChI)
○ Also @/@@, but only if canonical
Illuminating the black box
● Important to know what operations are being done
implicitly and what needs to be done explicitly
○ Are the error rates acceptable?
● Parse structure
○ Read list of atoms and bonds (incl. charges and isotopes)
○ [Mol, Mol2, Smi] Apply valence model
● Perceive aromaticity (or preserve from input)
● Perceive stereochemistry (or preserve from input)
● Optional: recognize atom / bond types, partial charges,
generate coordinates
c1ccccc1C(=O)Cl
Aromaticity
● Cheminformatics aromaticity not quite the
same as chemical aromaticity
○ Mainly a convenience for handling the fact that
the single/double bonds bonds in Kekulé systems
may be set differently
● Usually a good idea to export structures in
Kekulé form
○ More portable - tools may reject some SMILES in
aromatic form if they cannot kekulize them
○ Allows tools to apply their own aromaticity model
○ Faster if detection of aromaticity can be avoided
2D or 3D?
No Geometry
No Geometry
2D Geometry
3D Geometry
CN1C2=C(C(C3=CC=CC=C3)=NCC1=O)C=C(Cl)C=C2
Going from 2D to 3D
● Key point - easy to get a 3D structure, but is it
the 3D structure you want (or need)?
○ Do you need a single ‘reasonable’ structure or a
large number of conformations?
● Many tools to generate an acceptable 3D
structure from a 2D format
○ Usually a low energy conformation obtained via
molecular mechanics
● Conformer generators
○ Important to think about appropriate energy
and/or RMSD cutoffs
Moving from files to a database
● If you’re going beyond 100’s of molecules consider
using a chemically-aware database
○ Instant Jchem
○ MolEditor
● Not too difficult to roll your own using Open Source
but requires programming skills
● Don’t use Excel (even with ChemDraw)
○ Missing data is not handled consistently
○ Can mangle identifiers (parse them as dates)
○ Complicates workflows
○ Formatting can hinder efficient data analyses
○ Difficult to have multiple users
Verifying data quality
● This is all good if it’s your own compounds
● What about structures from someone else?
○ Need to check (& try to fix) nonsensical chemistry
● Check for
○ invalid valences, nonsense stereo, fragments
○ weird/invalid atoms, multiple radical centers
● Consider http://cvsp.chemspider.com/
Karapetyan et al, J. Cheminf, 2015
Structures are good. Are they useful?
● At this point you likely have a set of
correct (valid) structures
○ Are the structures useful for your purpose?
● A collection may have compounds with
problematic structures
○ Reactive groups, fluorophores, ADMET liabilities, …
● Consider rules & filters such as REOS, PAINS, Lilly
MedChem Rules
○ Implemented in commercial & OSS tools
○ Don’t use them blindly!
● Normalisation?
○ E.g. -N(=O)=O or –[N+][O-]=O (or doesn’t matter?)
What are you really looking for?
● Similarity searches are a common task
● What you get depends on
○ How the structure was entered
○ Normalization of structures
● But also on what you’re looking for
○ Connectivity
○ Atom & bond type
○ Shape or pharmacophore features …
● May be surprised by false
negatives
○ Test your query on structures
it should find
may not find
Because we love statistics & M/L
Alexander et al (2015)
Cherkasov et al (2014)
Huang & Fan (2013)
Chirico & Grammatica (2011)
Tropsha (2010)
Jain & Nicholls (2008)
Nicholls (2008)
Hawkins (2004)
Cronin & Schultz (2003)
• Look at your data, plot
your data
• Read up statistics
• Linear models are a
good start
• Most of this is not
about cheminformatics
• But the notion of
chemical space plays a
key role in this area
Summary
Do
1. Chose appropriate file
formats
2. Check data quality
3. Get involved in the
cheminformatics
community
4. Trust but verify
Don’t
1. Treat chemical software as
a black box
2. Assume geometry
3. Use M/L blindly
4. Did we mention Excel
already?
Acknowledgements
● John May (NextMove Software)
● Adam Yasgar, Madhu Lal-Nag (NCATS)

More Related Content

What's hot

Ontology Engineering for Big Data
Ontology Engineering for Big DataOntology Engineering for Big Data
Ontology Engineering for Big DataKouji Kozaki
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data IntegrationJanna Hastings
 
ontology based- data_integration.
ontology based- data_integration.ontology based- data_integration.
ontology based- data_integration.AliAlJadaa
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchRajarshi Guha
 
download
downloaddownload
downloadbutest
 
Ontology For Data Integration
Ontology For Data IntegrationOntology For Data Integration
Ontology For Data Integrationjuanesteva
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...IOSR Journals
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mappingsamhati27
 
from text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Ontofrom text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2OntoRadhoueneRouached
 
2.molecular modelling intro
2.molecular modelling intro2.molecular modelling intro
2.molecular modelling introAbhijeet Kadam
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelMihika Shah
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreAdriel Café
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matchingIJwest
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseEditor IJCATR
 
A Semi-Automatic Ontology Extension Method for Semantic Web Services
A Semi-Automatic Ontology Extension Method for Semantic Web ServicesA Semi-Automatic Ontology Extension Method for Semantic Web Services
A Semi-Automatic Ontology Extension Method for Semantic Web ServicesIDES Editor
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databaseseswcsummerschool
 

What's hot (19)

Ontology Engineering for Big Data
Ontology Engineering for Big DataOntology Engineering for Big Data
Ontology Engineering for Big Data
 
Ontology-based Data Integration
Ontology-based Data IntegrationOntology-based Data Integration
Ontology-based Data Integration
 
ontology based- data_integration.
ontology based- data_integration.ontology based- data_integration.
ontology based- data_integration.
 
Molecular Representation, Similarity and Search
Molecular Representation, Similarity and SearchMolecular Representation, Similarity and Search
Molecular Representation, Similarity and Search
 
download
downloaddownload
download
 
Ontology For Data Integration
Ontology For Data IntegrationOntology For Data Integration
Ontology For Data Integration
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mapping
 
Reference Ontology Presentation
Reference Ontology PresentationReference Ontology Presentation
Reference Ontology Presentation
 
from text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Ontofrom text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Onto
 
2.molecular modelling intro
2.molecular modelling intro2.molecular modelling intro
2.molecular modelling intro
 
Odbms concepts
Odbms conceptsOdbms concepts
Odbms concepts
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object model
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and more
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matching
 
Expression of Query in XML object-oriented database
Expression of Query in XML object-oriented databaseExpression of Query in XML object-oriented database
Expression of Query in XML object-oriented database
 
A Semi-Automatic Ontology Extension Method for Semantic Web Services
A Semi-Automatic Ontology Extension Method for Semantic Web ServicesA Semi-Automatic Ontology Extension Method for Semantic Web Services
A Semi-Automatic Ontology Extension Method for Semantic Web Services
 
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and DatabasesESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
ESWC SS 2012 - Monday Keynote Enrico Franconi: Ontologies and Databases
 
Artificial Intelligence of the Web through Domain Ontologies
Artificial Intelligence of the Web through Domain OntologiesArtificial Intelligence of the Web through Domain Ontologies
Artificial Intelligence of the Web through Domain Ontologies
 

Similar to So I have an SD File... What do I do next?

So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?Rajarshi Guha
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontGreg Landrum
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Richard Lewis
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learningTheodoros Vasiloudis
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeChris Mungall
 
Sprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfSprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfChristian Zellot
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryMarcus Hanwell
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examplesFelipe
 
Object Oriented Software Development revision slide
Object Oriented Software Development revision slide Object Oriented Software Development revision slide
Object Oriented Software Development revision slide fauza jali
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering PrimerGeorg Buske
 
How to do your Advanced Level (AL) studies successfully
How to do your Advanced Level (AL) studies successfullyHow to do your Advanced Level (AL) studies successfully
How to do your Advanced Level (AL) studies successfullyAurora Computer Studies
 
Nautilus LIMS: Two Months to Two Hours
Nautilus LIMS: Two Months to Two HoursNautilus LIMS: Two Months to Two Hours
Nautilus LIMS: Two Months to Two HoursMichael Soh
 

Similar to So I have an SD File... What do I do next? (20)

So I have an SD File … What do I do next?
So I have an SD File … What do I do next?So I have an SD File … What do I do next?
So I have an SD File … What do I do next?
 
Some "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data frontSome "challenges" on the open-source/open-data front
Some "challenges" on the open-source/open-data front
 
Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications Sharing chemical structures with peer reviewed publications
Sharing chemical structures with peer reviewed publications
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Take Note of Note Taking
Take Note of Note TakingTake Note of Note Taking
Take Note of Note Taking
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?Chemical features: how do we describe a compound to a computer?
Chemical features: how do we describe a compound to a computer?
 
Approaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical dataApproaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical data
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
All together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of lifeAll together now: piecing together the knowledge graph of life
All together now: piecing together the knowledge graph of life
 
Sprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdfSprint Boot & Kotlin - Meetup.pdf
Sprint Boot & Kotlin - Meetup.pdf
 
Avogadro 2 and Open Chemistry
Avogadro 2 and Open ChemistryAvogadro 2 and Open Chemistry
Avogadro 2 and Open Chemistry
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
Object Oriented Software Development revision slide
Object Oriented Software Development revision slide Object Oriented Software Development revision slide
Object Oriented Software Development revision slide
 
Software Engineering Primer
Software Engineering PrimerSoftware Engineering Primer
Software Engineering Primer
 
How to do your Advanced Level (AL) studies successfully
How to do your Advanced Level (AL) studies successfullyHow to do your Advanced Level (AL) studies successfully
How to do your Advanced Level (AL) studies successfully
 
Nautilus LIMS: Two Months to Two Hours
Nautilus LIMS: Two Months to Two HoursNautilus LIMS: Two Months to Two Hours
Nautilus LIMS: Two Months to Two Hours
 
Object Calisthenics in Objective-C
Object Calisthenics in Objective-CObject Calisthenics in Objective-C
Object Calisthenics in Objective-C
 
XAI (IIT-Patna).pdf
XAI (IIT-Patna).pdfXAI (IIT-Patna).pdf
XAI (IIT-Patna).pdf
 

More from baoilleach

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESbaoilleach
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overviewbaoilleach
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Webbaoilleach
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringbaoilleach
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2baoilleach
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babelbaoilleach
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand dockingbaoilleach
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculationbaoilleach
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSARbaoilleach
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsbaoilleach
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papersbaoilleach
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...baoilleach
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...baoilleach
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tunebaoilleach
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...baoilleach
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopybaoilleach
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devicesbaoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...baoilleach
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...baoilleach
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment ratesbaoilleach
 

More from baoilleach (20)

We need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILESWe need to talk about Kekulization, Aromaticity and SMILES
We need to talk about Kekulization, Aromaticity and SMILES
 
Open Babel project overview
Open Babel project overviewOpen Babel project overview
Open Babel project overview
 
Chemistrify the Web
Chemistrify the WebChemistrify the Web
Chemistrify the Web
 
Universal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES stringUniversal Smiles: Finally a canonical SMILES string
Universal Smiles: Finally a canonical SMILES string
 
What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2What's New and Cooking in Open Babel 2.3.2
What's New and Cooking in Open Babel 2.3.2
 
Intro to Open Babel
Intro to Open BabelIntro to Open Babel
Intro to Open Babel
 
Protein-ligand docking
Protein-ligand dockingProtein-ligand docking
Protein-ligand docking
 
Making the most of a QM calculation
Making the most of a QM calculationMaking the most of a QM calculation
Making the most of a QM calculation
 
Data Analysis in QSAR
Data Analysis in QSARData Analysis in QSAR
Data Analysis in QSAR
 
Large-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cellsLarge-scale computational design and selection of polymers for solar cells
Large-scale computational design and selection of polymers for solar cells
 
My Open Access papers
My Open Access papersMy Open Access papers
My Open Access papers
 
Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...Improving the quality of chemical databases with community-developed tools (a...
Improving the quality of chemical databases with community-developed tools (a...
 
De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...De novo design of molecular wires with optimal properties for solar energy co...
De novo design of molecular wires with optimal properties for solar energy co...
 
Cinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tuneCinfony - Bring cheminformatics toolkits into tune
Cinfony - Bring cheminformatics toolkits into tune
 
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...Density functional theory calculations on Ruthenium polypyridyl complexes inc...
Density functional theory calculations on Ruthenium polypyridyl complexes inc...
 
Application of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling MicroscopyApplication of Density Functional Theory to Scanning Tunneling Microscopy
Application of Density Functional Theory to Scanning Tunneling Microscopy
 
Towards Practical Molecular Devices
Towards Practical Molecular DevicesTowards Practical Molecular Devices
Towards Practical Molecular Devices
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...Why multiple scoring functions can improve docking performance - Testing hypo...
Why multiple scoring functions can improve docking performance - Testing hypo...
 
Improving enrichment rates
Improving enrichment ratesImproving enrichment rates
Improving enrichment rates
 

Recently uploaded

‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiologyDrAnita Sharma
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 

Recently uploaded (20)

‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Hauz Khas Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 
insect anatomy and insect body wall and their physiology
insect anatomy and insect body wall and their  physiologyinsect anatomy and insect body wall and their  physiology
insect anatomy and insect body wall and their physiology
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 

So I have an SD File... What do I do next?

  • 1. So I have an SD File … What do I do next? Rajarshi Guha & Noel O’Boyle NCATS & NextMove Software ACS National Meeting, Boston 2015
  • 2. What do you want to do? What is the core issue? • What you see on a screen isn’t necessarily what you get in a file • Need to be aware of how certain chemical concepts are handled in software Tasks to be considered • Searching for structures • Managing inventory • Linking / merging structure data to other data • Predicting properties or analysis of bioactivity data
  • 3. Which file format for data storage? ● The answer to this question is never XYZ or PDB o Don’t use a file format that throws away parts of your chemical structure (connectivity, bond orders or formal charges) o Software has to guess the missing information ● And probably not InChI o Without the ‘AuxInfo’, the chemical structure obtained from an InChI is not necessarily the same as the original (e.g. amides to imidic acids) ● SMILES and MOL are your go-to formats ● Widely supported (i.e. portable), can recreate the original structure
  • 4. The question of identity ● A file format is not the same as an identifier o The same molecule can be represented in different ways, even in the same format ● A “canonical” representation is required ○ To check identity, find or avoid duplicates, find overlap of two databases or check that a structure remains unchanged (e.g. after some transformation) ● Only InChI (and IUPAC names) are canonical by definition, but canonical versions of other formats can be generated C C O C C O Ethanol can be represented in SMILES format as CCO or OCC (among others)
  • 5. Canonical SMILES ● Atom order is the same whatever the input ● BUT, every toolkit has its own canonicalization algorithm (which may change over time) ○ Consistent within the toolkit, not neccesarily outside ● Don’t assume that a given SMILES is in a canonical form ○ If necessary, canonicalize them yourself Ethanol as CCO, OCC, C(O)C all converted to CCO (by Toolkit#1) Ethanol as CCO, OCC, C(O)C all converted to OCC (by Toolkit#2)
  • 6. Depictions vs computers ● Are your structures drawn for humans or computers? ○ There are 2D depictions of stereochemistry that are instantly interpretable by a human but which are commonly misinterpreted by software ● Chirality of (a) is opposite to (c) ○ But what is the chirality of (b)? ● Possibilities: ○ Undefined (according to InChI, if close to 180°) ○ Same as (a) or (c) depending on which side of 180°
  • 7. Rings with ‘implicit’ 3D You drew You meant You may get
  • 8. Tetrahedral stereo gotchas ● R/S in IUPAC names, @/@@ in SMILES, 1/2 in MOL files, +/- in InChIs ● None of these directly correspond to another ○ SMILES and Mol files describe stereo in terms of atom order, but differ in where implicit hydrogens are located ○ InChI and IUPAC names both use a complex algorithm to determine the symbol ● Only two of these formats may always be used to compare two structures: ○ R/S and /m layer (InChI) ○ Also @/@@, but only if canonical
  • 9. Illuminating the black box ● Important to know what operations are being done implicitly and what needs to be done explicitly ○ Are the error rates acceptable? ● Parse structure ○ Read list of atoms and bonds (incl. charges and isotopes) ○ [Mol, Mol2, Smi] Apply valence model ● Perceive aromaticity (or preserve from input) ● Perceive stereochemistry (or preserve from input) ● Optional: recognize atom / bond types, partial charges, generate coordinates c1ccccc1C(=O)Cl
  • 10. Aromaticity ● Cheminformatics aromaticity not quite the same as chemical aromaticity ○ Mainly a convenience for handling the fact that the single/double bonds bonds in Kekulé systems may be set differently ● Usually a good idea to export structures in Kekulé form ○ More portable - tools may reject some SMILES in aromatic form if they cannot kekulize them ○ Allows tools to apply their own aromaticity model ○ Faster if detection of aromaticity can be avoided
  • 11. 2D or 3D? No Geometry No Geometry 2D Geometry 3D Geometry CN1C2=C(C(C3=CC=CC=C3)=NCC1=O)C=C(Cl)C=C2
  • 12. Going from 2D to 3D ● Key point - easy to get a 3D structure, but is it the 3D structure you want (or need)? ○ Do you need a single ‘reasonable’ structure or a large number of conformations? ● Many tools to generate an acceptable 3D structure from a 2D format ○ Usually a low energy conformation obtained via molecular mechanics ● Conformer generators ○ Important to think about appropriate energy and/or RMSD cutoffs
  • 13. Moving from files to a database ● If you’re going beyond 100’s of molecules consider using a chemically-aware database ○ Instant Jchem ○ MolEditor ● Not too difficult to roll your own using Open Source but requires programming skills ● Don’t use Excel (even with ChemDraw) ○ Missing data is not handled consistently ○ Can mangle identifiers (parse them as dates) ○ Complicates workflows ○ Formatting can hinder efficient data analyses ○ Difficult to have multiple users
  • 14. Verifying data quality ● This is all good if it’s your own compounds ● What about structures from someone else? ○ Need to check (& try to fix) nonsensical chemistry ● Check for ○ invalid valences, nonsense stereo, fragments ○ weird/invalid atoms, multiple radical centers ● Consider http://cvsp.chemspider.com/ Karapetyan et al, J. Cheminf, 2015
  • 15. Structures are good. Are they useful? ● At this point you likely have a set of correct (valid) structures ○ Are the structures useful for your purpose? ● A collection may have compounds with problematic structures ○ Reactive groups, fluorophores, ADMET liabilities, … ● Consider rules & filters such as REOS, PAINS, Lilly MedChem Rules ○ Implemented in commercial & OSS tools ○ Don’t use them blindly! ● Normalisation? ○ E.g. -N(=O)=O or –[N+][O-]=O (or doesn’t matter?)
  • 16. What are you really looking for? ● Similarity searches are a common task ● What you get depends on ○ How the structure was entered ○ Normalization of structures ● But also on what you’re looking for ○ Connectivity ○ Atom & bond type ○ Shape or pharmacophore features … ● May be surprised by false negatives ○ Test your query on structures it should find may not find
  • 17. Because we love statistics & M/L Alexander et al (2015) Cherkasov et al (2014) Huang & Fan (2013) Chirico & Grammatica (2011) Tropsha (2010) Jain & Nicholls (2008) Nicholls (2008) Hawkins (2004) Cronin & Schultz (2003) • Look at your data, plot your data • Read up statistics • Linear models are a good start • Most of this is not about cheminformatics • But the notion of chemical space plays a key role in this area
  • 18. Summary Do 1. Chose appropriate file formats 2. Check data quality 3. Get involved in the cheminformatics community 4. Trust but verify Don’t 1. Treat chemical software as a black box 2. Assume geometry 3. Use M/L blindly 4. Did we mention Excel already?
  • 19. Acknowledgements ● John May (NextMove Software) ● Adam Yasgar, Madhu Lal-Nag (NCATS)

Editor's Notes

  1. Docking software adjusts dihedral angles to generate conformations but leaves bond angles unchanged Molecular descriptor software may compute values assuming a ‘flat’ 3D structure.
  2. Applies to inventory maintenance, integrating data from multiple sources
  3. This is more oriented towards biologists than chemists