SlideShare a Scribd company logo
1 of 73
How the InChI identifier is used to
underpin our online chemistry
databases at RSC
Antony Williams, Valery Tkachenko
and Ken Karapetyan
ACS San Francisco
August 2014
What can I say that I haven’t said?
What can I say that I haven’t said?
What can I say that I haven’t said?
YouTube InChIKey Collision Movie
What can I say that I haven’t said?
InChI is for machines but do
have a human aspect…
Many Names, One Structure
Structure Identifiers
OPSIN (chemical name to structure) http
://opsin.ch.cam.ac.uk/
• InChI support systems…
InChI mapping helps a lot!
• We wanted to map together chemical data on
the web
• We knew that chemical name mapping was
difficult but dictionaries were useful
• It is InChI that became the foundation
technology for our database…
• We accepted all the limitations of InChI
• We lived with the “Useful but not ideal”
• And so….
• ~32 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
ChemSpider
So where can we travel???
InChI String Search via Google
So give me InChIKeys…
And where can we travel???
And where can we travel???
And where can we travel???
And where can we travel???
NEW
15th
Edition
*The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co.,
Inc., Whitehouse Station, N.J., U.S.A., and is licensed to The Royal Society of Chemistry for use in the
U.S.A. and Canada.
Where else is RSC using InChIs
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer ,
thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-N-
methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue
SO MANY reactions!
Extracting our Archive
• What could we get from our archive?
• Find chemical names and generate structures
• Find chemical images and generate structures
• Find reactions
• Find data (MP, BP, LogP) and deposit
• Find figures and database them
• Find spectra (and link to structures)
• And of course InChIfy the entire collection
After we mine the Archive
Models published from data
Text-mining Data to compare
Progress to date
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Marked up with XML and published onto the
HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, text mining iterations
• New visualization tools in development – not
just chemical names. Add chemical and
biomedical terms markup also!
MedChemComm markup
MedChemComm markup
MedChemComm markup
InChIs under our “repository”
• Scientific publications are a summary of work
• Is all work reported?
• How much science is lost to pruning?
• What of value sits in notebooks and is lost?
• Publications offering access to “real data”?
• How much data is lost?
• How many compounds never reported?
• How many syntheses fail or succeed?
• How many characterization measurements?
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical
data, tabular data
• Algorithms for data validation and standardization
• Flexible indexing and search technologies
• A platform for modeling data and hosting existing
models and predictive algorithms
New Repository Architecture
Compounds Reactions Spectra Materials Documents
Compounds
API
Reactions
API
Spectra
API
Materials
API
Documents
API
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Materials
Widgets
Documents
Widgets
Data tier
Data access
tier
User
interface
components
tier
Analytical Laboratory application
User
interface tier
(examples) Electronic Laboratory Notebook
Paid 3rd
party integrations (various platforms – SharePoint, Google, etc)
Chemical Inventory application
Deposition of Data
Compounds
Reactions
Analytical data
Crystallography data
InChIs under the repository
• All compound-based data handling will of
course connect with InChIs
• Compounds
• Reactions
• Compound-spectra matching
• Etc. etc. etc…
For Deposition of Data
• Developing systems that provides
feedback to users regarding data quality
• Validate/standardize chemical compounds
• Check for balanced reactions
• Checks spectral data
• EXAMPLE Future work
• Properties – compare experimental to pred.
• Automated structure verification - NMR
RSC Cheminformatics Projects
• RSC as a provider of support for grant-based
projects
• Utilizing ChemSpider initially as a platform
• Developing Chemical Registry Service
• Utilizing core architecture and widgets to
serve the projects
The PharmaSea Website
• ChemSpider IDs and InChIs/InChIKeys
made open and available for linking
• Exposed via the Open PHACTS RDF export
• A structure ID standard to enable further
linking across the semantic web of science
InChIs and DDP
Electronic Notebook Data
• Development work integrating chemistry
into the Southampton Labtrove notebook
• Stoichiometry table development
• Analytical data integration
• “ChemTrove” includes chemistry widgets
and InChI as an important data field
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues
Depiction based on molfile
Standardize
• Use the SRS as guidance for standardization
• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
What needs to happen?
• If we could validate
• Catch errors in databases (and clean)
• Proactively catch errors in publications/patents
• Reduce junk in the ether – improve QUALITY!
• If we standardized
• Interlinking should improve
Validate and Standardize
CVSP Filtering
CVSP Filtering of DrugBank
DrugBank (ca. 6000 records)
• 38 records with InChI not matching the
structure, e.g. DB08521, DB08187
• 24 records where names (IUPAC_NAME) did
not match the structure, e.g. DB08346
• 38 records with SMILES not matching the
structure, e.g. DB08293
• 53 records with unusual valence, e.g. DB01983
with boron(V)
ChEMBL (1.3 million records)
• 11,020 records with 4 bonds and zero charge,
e.g. CHEMBL501101 or CHEMBL501973
• 271 records with hypervalent oxygen (e.g. ,
CHEMBL2219679), carbon (e.g. 1005895),
boron, chlorine, iodine or phosphine
• 6,177 records where direction of bond makes
no sense, e.g. CHEMBL12760 and
CHEMBL34704
ChemSpider Standardization
• Entire ChemSpider database will be
standardized using modified FDA rule set
• Original Molfiles will be standardized and all
properties (predicted properties, SMILES,
InChIs, Names) will all be regenerated
• CLEAN’ed database to compounds repository
• Standardization procedures automatically
applied to all future depositions
Recent Data (last week)
Internet Data
Data Repositories and InChI
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
If InChI was not developed…
• Database linking would suffer dramatically
• The web would not be “structure searchable”
• Cheminformatics tools would likely not be
linking to public domain databases in the
same way
• We wouldn’t be here discussing….
• And ChemSpider would not have been built
Acknowledgments
• The InChI team
• The entire RSC cheminformatics team…
• Daniel Lowe for the text mining work
• Igor Tetko for OCHEM modeling
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

What's hot

ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectStuart Chalk
 

What's hot (18)

The future of scientific information & communication
The future of scientific information & communicationThe future of scientific information & communication
The future of scientific information & communication
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Value of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry communityValue of the mediawiki platform for providing content to the chemistry community
Value of the mediawiki platform for providing content to the chemistry community
 
Cheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural ProductsCheminformatics and the Structure Elucidation of Natural Products
Cheminformatics and the Structure Elucidation of Natural Products
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
Building a data repository to manage chemistry research data
Building a data repository to manage chemistry research dataBuilding a data repository to manage chemistry research data
Building a data repository to manage chemistry research data
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
 
Encouraging undergraduate students to participate as authors of scientific pu...
Encouraging undergraduate students to participate as authors of scientific pu...Encouraging undergraduate students to participate as authors of scientific pu...
Encouraging undergraduate students to participate as authors of scientific pu...
 
Hosting a compound centric community resource for chemistry data
Hosting a compound centric community resource for chemistry dataHosting a compound centric community resource for chemistry data
Hosting a compound centric community resource for chemistry data
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 

Similar to How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineKen Karapetyan
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryDr. Haxel Consult
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectKen Karapetyan
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...Ken Karapetyan
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 

Similar to How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry (20)

The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
 
Ontology work at the Royal Society of Chemistry
Ontology work at the Royal Society of ChemistryOntology work at the Royal Society of Chemistry
Ontology work at the Royal Society of Chemistry
 
Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...
 
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of ChemistryICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
ICIC 2013 Conference Proceedings Antony Williams Royal Society of Chemistry
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...ChemValidator – an online service for validating and standardizing chemical s...
ChemValidator – an online service for validating and standardizing chemical s...
 
Web Crawling Chemistry
Web Crawling ChemistryWeb Crawling Chemistry
Web Crawling Chemistry
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
 

Recently uploaded

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

Recently uploaded (20)

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

  • 1. How the InChI identifier is used to underpin our online chemistry databases at RSC Antony Williams, Valery Tkachenko and Ken Karapetyan ACS San Francisco August 2014
  • 2. What can I say that I haven’t said?
  • 3. What can I say that I haven’t said?
  • 4. What can I say that I haven’t said? YouTube InChIKey Collision Movie
  • 5. What can I say that I haven’t said?
  • 6. InChI is for machines but do have a human aspect…
  • 7. Many Names, One Structure
  • 9. OPSIN (chemical name to structure) http ://opsin.ch.cam.ac.uk/ • InChI support systems…
  • 10. InChI mapping helps a lot! • We wanted to map together chemical data on the web • We knew that chemical name mapping was difficult but dictionaries were useful • It is InChI that became the foundation technology for our database… • We accepted all the limitations of InChI • We lived with the “Useful but not ideal” • And so….
  • 11. • ~32 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  • 13. So where can we travel???
  • 14.
  • 15. InChI String Search via Google So give me InChIKeys…
  • 16. And where can we travel???
  • 17. And where can we travel???
  • 18. And where can we travel???
  • 19. And where can we travel???
  • 20. NEW 15th Edition *The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Whitehouse Station, N.J., U.S.A., and is licensed to The Royal Society of Chemistry for use in the U.S.A. and Canada. Where else is RSC using InChIs
  • 21.
  • 22.
  • 23.
  • 24. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 25. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N- methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 27. Extracting our Archive • What could we get from our archive? • Find chemical names and generate structures • Find chemical images and generate structures • Find reactions • Find data (MP, BP, LogP) and deposit • Find figures and database them • Find spectra (and link to structures) • And of course InChIfy the entire collection
  • 28. After we mine the Archive
  • 31. Progress to date • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Marked up with XML and published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, text mining iterations • New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
  • 35. InChIs under our “repository” • Scientific publications are a summary of work • Is all work reported? • How much science is lost to pruning? • What of value sits in notebooks and is lost? • Publications offering access to “real data”? • How much data is lost? • How many compounds never reported? • How many syntheses fail or succeed? • How many characterization measurements?
  • 36. New Repository Architecture doi: 10.1007/s10822-014-9784-5
  • 37. What are we building? • We are building the “RSC Data Repository” • Containers for compounds, reactions, analytical data, tabular data • Algorithms for data validation and standardization • Flexible indexing and search technologies • A platform for modeling data and hosting existing models and predictive algorithms
  • 38. New Repository Architecture Compounds Reactions Spectra Materials Documents Compounds API Reactions API Spectra API Materials API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Materials Widgets Documents Widgets Data tier Data access tier User interface components tier Analytical Laboratory application User interface tier (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application
  • 44. InChIs under the repository • All compound-based data handling will of course connect with InChIs • Compounds • Reactions • Compound-spectra matching • Etc. etc. etc…
  • 45. For Deposition of Data • Developing systems that provides feedback to users regarding data quality • Validate/standardize chemical compounds • Check for balanced reactions • Checks spectral data • EXAMPLE Future work • Properties – compare experimental to pred. • Automated structure verification - NMR
  • 46. RSC Cheminformatics Projects • RSC as a provider of support for grant-based projects • Utilizing ChemSpider initially as a platform • Developing Chemical Registry Service • Utilizing core architecture and widgets to serve the projects
  • 47.
  • 48.
  • 49.
  • 51. • ChemSpider IDs and InChIs/InChIKeys made open and available for linking • Exposed via the Open PHACTS RDF export • A structure ID standard to enable further linking across the semantic web of science
  • 53. Electronic Notebook Data • Development work integrating chemistry into the Southampton Labtrove notebook • Stoichiometry table development • Analytical data integration • “ChemTrove” includes chemistry widgets and InChI as an important data field
  • 54.
  • 55. Side Effects of InChI Usage
  • 57. Side Effects of InChI Usage
  • 59. Standardize • Use the SRS as guidance for standardization • Adjust as necessary to our needs
  • 61. Salt and Ionic Bonds
  • 62. What needs to happen? • If we could validate • Catch errors in databases (and clean) • Proactively catch errors in publications/patents • Reduce junk in the ether – improve QUALITY! • If we standardized • Interlinking should improve
  • 65. CVSP Filtering of DrugBank
  • 66. DrugBank (ca. 6000 records) • 38 records with InChI not matching the structure, e.g. DB08521, DB08187 • 24 records where names (IUPAC_NAME) did not match the structure, e.g. DB08346 • 38 records with SMILES not matching the structure, e.g. DB08293 • 53 records with unusual valence, e.g. DB01983 with boron(V)
  • 67. ChEMBL (1.3 million records) • 11,020 records with 4 bonds and zero charge, e.g. CHEMBL501101 or CHEMBL501973 • 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine • 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704
  • 68. ChemSpider Standardization • Entire ChemSpider database will be standardized using modified FDA rule set • Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated • CLEAN’ed database to compounds repository • Standardization procedures automatically applied to all future depositions
  • 70. Internet Data Data Repositories and InChI Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals
  • 71. If InChI was not developed… • Database linking would suffer dramatically • The web would not be “structure searchable” • Cheminformatics tools would likely not be linking to public domain databases in the same way • We wouldn’t be here discussing…. • And ChemSpider would not have been built
  • 72. Acknowledgments • The InChI team • The entire RSC cheminformatics team… • Daniel Lowe for the text mining work • Igor Tetko for OCHEM modeling
  • 73. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Editor's Notes

  1. The content of the 15th Edition was produced by the Editorial team and Merck, and the book was published by us earlier this year. It is available for purchase by both individuals and libraries. In addition to the book, we have produced and Online version of The Merck index, which is available solely through the Royal Society of Chemistry. This is available as a subscription, with one year free trial to individual purchasers of the book.
  2. Compound list (rather than articles)