SlideShare a Scribd company logo
Chemicals, Chemical Identifiers and
Navigating Through Databases
Antony Williams
UNC Chapel Hill, October 2010
Chemistry on the Internet
 Where do you source chemistry information?
 What can you trust online?
 How can you recognize potential issues?
 Cross-referencing and curating data
What is the Structure of Vitamin K?
MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione) derived
from plants, VITAMIN K 2 (menaquinone) from
bacteria, and synthetic naphthoquinone provitamins,
VITAMIN K 3 (menadione). Vitamin K 3 provitamins,
after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
What is the Structure of Vitamin K1?
Wikipedia
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem
 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Bioassay Data are Associated…
Lack of Stereochemistry
ChEBI – Manual Curation
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles
 10 9 0 0 1 0 0 0 0 0 1 V2000
 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
 3 1 2 0 0 0 0
 4 1 1 0 0 0 0
 9 1 1 0 0 0 0
 7 2 1 0 0 0 0
 5 2 2 0 0 0 0
 8 2 1 0 0 0 0
 6 4 1 0 0 0 0
 4 10 1 6 0 0 0
 7 6 1 0 0 0 0
 M END
Molfiles
 Molfiles are the primary exchange format between
structure drawing packages
 Can be different between different drawing packages
 Most commonly carry X,Y coordinates for layout
 Can support polymers, organometallics, etc.
 Can carry 3D coordinates
SMILES (http://en.wikipedia.org/wiki/SMILES)
 SMILES is a common format
 Can support polymers,
organometallics, etc.
 Does NOT carry X,Y or Z
coordinates for layout so
requires layout algorithms –
can be problematic!
 Generally different between
drawing packages
Stereo
Tautomers
SMILES
 ACD/Labs
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O
 OpenEye
 CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C
@H](C)CCC[C@H](C)CCCC(C)C
 ChEMBL
 CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C
The InChI Identifier
InChI
 SINGLE code base managed by IUPAC –
integrated into drawing packages. No variability
as with SMILES
 InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
 Well adopted by the community (databases,
publishers, blogs, Wikipedia) – good for searching
the internet
Multiple Layers
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for Stereochemistry
Use your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIStrings Hash to InChIKeys
PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYFJB-FAKCLFGASA-N
 MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)
 MBWXNTAXLNYFJB-ODDKJFTJSA-N
 MBWXNTAXLNYFJB-KSVLJPARSA-N
 MBWXNTAXLNYFJB-UDCSOKOMSA-N
 MBWXNTAXLNYFJB-JHBCSKSVSA-N
 MBWXNTAXLNYFJB-JXAKDHTRSA-N
PubChem InChIKeys
 MBWXNTAXLNYFJB-NKFFZRIASA-N
 MBWXNTAXLNYFJB-LKUDQCMESA-N
 MBWXNTAXLNYFJB-UHFFFAOYSA-N
 MBWXNTAXLNYFJB-FAKCLFGASA-N
 MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)
 MBWXNTAXLNYFJB-ODDKJFTJSA-N
 MBWXNTAXLNYFJB-KSVLJPARSA-N
 MBWXNTAXLNYFJB-UDCSOKOMSA-N
 MBWXNTAXLNYFJB-JHBCSKSVSA-N
 MBWXNTAXLNYFJB-JXAKDHTRSA-N
Databases and Standardization
Databases and Standardization
InChI
 No support for polymers, organometallics
 Many option settings can lead to variability and
make integration across databases difficult –
FixedH option especially problematic
 “Slight” chance of collisions of InChIKeys
 VERY USEFUL FOR INTEGRATING THE WEB
Vancomycin
Vancomycin
Search Molecular
SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online?
 Encyclopedic articles (Wikipedia)
 Chemical vendor databases
 Metabolic pathway databases
 Property databases
 Patents with chemical structures
 Drug Discovery data
 Scientific publications
 Compound aggregators
 Blogs/Wikis and Open Notebook Science
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
www.chemspider.com
Search for a Chemical…by name
Available Information…
 Linked to vendors, safety data, toxicity, metabolism
How do we build it?
 25 million chemicals from 400 data sources
 We deal in Molfiles or SDF files – including
coordinates
 We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition
 We have our own “business logic” to standardize
 We use InChI to “aggregate tautomers” to one
record
 We link out to external sites where possible using
their IDs
Inherited Errors
 We have inherited errors from every database…
all public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE
Compounds and Identifiers
Be careful searching by Name!
 Determining the correct structure by name
searching is difficult online! Good, not perfect
 Wikipedia
 ChEBI/ChEMBL
 ChemIDPlus
 ChemSpider
 Be VERY careful with MOST databases
Validating structures
 Check for “full stereo” and use stereo descriptors
especially for checking!
 Check for quality of associated data sources
 Check against reference literature when available
– but it can be wrong
 Question EVERYTHING!
Online Curation
 Online databases generally do NOT allow
curation or annotation
 If you find errors they stay there!
 ChemSpider is unique…immediate curation
 ChemSpider live demo following this lecture
 Searching
 Deposition and Curation
 ChemSpider SyntheticPages
Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

Similar to Chemicals, Chemical Identifiers and Navigating Through Databases

ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Online Public Compound Databases
Online Public Compound DatabasesOnline Public Compound Databases
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF Talk
Markus Sitzmann
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
ChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry ResourcesChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry Resources
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
guest01a117
 
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Taming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can HelpTaming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can Help
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
wadhava gurumeet
 
2011 ebi industry workshop
2011 ebi industry workshop2011 ebi industry workshop
2011 ebi industry workshop
Michel Dumontier
 
Web Crawling Chemistry
Web Crawling ChemistryWeb Crawling Chemistry
Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008Whitney Symposium Lecture June 2008

Similar to Chemicals, Chemical Identifiers and Navigating Through Databases (20)

ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
 
Online Public Compound Databases
Online Public Compound DatabasesOnline Public Compound Databases
Online Public Compound Databases
 
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
 
Navigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpiderNavigating the Complex Web of Chemistry Using ChemSpider
Navigating the Complex Web of Chemistry Using ChemSpider
 
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF Talk
 
Why Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpiderWhy Chemistry and the Web Will Benefit from a ChemSpider
Why Chemistry and the Web Will Benefit from a ChemSpider
 
ChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry ResourcesChemSpider -Connecting and Curating Online Chemistry Resources
ChemSpider -Connecting and Curating Online Chemistry Resources
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010Ebi public meeting on internet chemistry databases november 2010
Ebi public meeting on internet chemistry databases november 2010
 
Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider Crawling Across the Web of Chemistry Using ChemSpider
Crawling Across the Web of Chemistry Using ChemSpider
 
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
A Presentation At Nature Publishing Group Crowdsourcing, Collaborations And T...
 
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...ChemSpider  and How The Wisdom Of The  Crowds  Can  Improve The  Quality Of  ...
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
 
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
 
Taming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can HelpTaming The Wild West Of Internet Based Chemistry You Can Help
Taming The Wild West Of Internet Based Chemistry You Can Help
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
2011 ebi industry workshop
2011 ebi industry workshop2011 ebi industry workshop
2011 ebi industry workshop
 
Web Crawling Chemistry
Web Crawling ChemistryWeb Crawling Chemistry
Web Crawling Chemistry
 
Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008Whitney Symposium Lecture June 2008
Whitney Symposium Lecture June 2008
 

Recently uploaded

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Chemicals, Chemical Identifiers and Navigating Through Databases

  • 1. Chemicals, Chemical Identifiers and Navigating Through Databases Antony Williams UNC Chapel Hill, October 2010
  • 2. Chemistry on the Internet  Where do you source chemistry information?  What can you trust online?  How can you recognize potential issues?  Cross-referencing and curating data
  • 3. What is the Structure of Vitamin K?
  • 4. MeSH  A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 5. What is the Structure of Vitamin K1?
  • 7. What is the Structure of Vitamin K1?
  • 10.
  • 11. “2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)naphthalene-1,4-dione”  Variants of systematic names on PubChem  2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl  2-methyl-3-(3,7,11,15-tetramethyl  2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 12. Bioassay Data are Associated…
  • 13.
  • 15. ChEBI – Manual Curation
  • 16.
  • 17.
  • 18.
  • 20. Molfiles  10 9 0 0 1 0 0 0 0 0 1 V2000  31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0  26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0  32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0  30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0  3 1 2 0 0 0 0  4 1 1 0 0 0 0  9 1 1 0 0 0 0  7 2 1 0 0 0 0  5 2 2 0 0 0 0  8 2 1 0 0 0 0  6 4 1 0 0 0 0  4 10 1 6 0 0 0  7 6 1 0 0 0 0  M END
  • 21. Molfiles  Molfiles are the primary exchange format between structure drawing packages  Can be different between different drawing packages  Most commonly carry X,Y coordinates for layout  Can support polymers, organometallics, etc.  Can carry 3D coordinates
  • 22. SMILES (http://en.wikipedia.org/wiki/SMILES)  SMILES is a common format  Can support polymers, organometallics, etc.  Does NOT carry X,Y or Z coordinates for layout so requires layout algorithms – can be problematic!  Generally different between drawing packages
  • 25. SMILES  ACD/Labs  CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O  OpenEye  CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C @H](C)CCC[C@H](C)CCCC(C)C  ChEMBL  CC(C)CCC[C@@H](C)CCC[C@@H] (C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C
  • 27. InChI  SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES  InChI Strings can be reversed to structures – same problem as with SMILES – no layout  Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet
  • 29. Tautomers – “Mobile H Perception”
  • 33. Checking for Stereochemistry Use your drawing package!
  • 37. InChIStrings Hash to InChIKeys
  • 38.
  • 39. PubChem InChIKeys  MBWXNTAXLNYFJB-NKFFZRIASA-N  MBWXNTAXLNYFJB-LKUDQCMESA-N  MBWXNTAXLNYFJB-UHFFFAOYSA-N  MBWXNTAXLNYFJB-FAKCLFGASA-N  MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)  MBWXNTAXLNYFJB-ODDKJFTJSA-N  MBWXNTAXLNYFJB-KSVLJPARSA-N  MBWXNTAXLNYFJB-UDCSOKOMSA-N  MBWXNTAXLNYFJB-JHBCSKSVSA-N  MBWXNTAXLNYFJB-JXAKDHTRSA-N
  • 40. PubChem InChIKeys  MBWXNTAXLNYFJB-NKFFZRIASA-N  MBWXNTAXLNYFJB-LKUDQCMESA-N  MBWXNTAXLNYFJB-UHFFFAOYSA-N  MBWXNTAXLNYFJB-FAKCLFGASA-N  MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label)  MBWXNTAXLNYFJB-ODDKJFTJSA-N  MBWXNTAXLNYFJB-KSVLJPARSA-N  MBWXNTAXLNYFJB-UDCSOKOMSA-N  MBWXNTAXLNYFJB-JHBCSKSVSA-N  MBWXNTAXLNYFJB-JXAKDHTRSA-N
  • 43. InChI  No support for polymers, organometallics  Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic  “Slight” chance of collisions of InChIKeys  VERY USEFUL FOR INTEGRATING THE WEB
  • 48. Where is chemistry online?  Encyclopedic articles (Wikipedia)  Chemical vendor databases  Metabolic pathway databases  Property databases  Patents with chemical structures  Drug Discovery data  Scientific publications  Compound aggregators  Blogs/Wikis and Open Notebook Science
  • 49. Linked Data on the Web Taken from: Rafael Sidis’ Blog
  • 51. Search for a Chemical…by name
  • 52. Available Information…  Linked to vendors, safety data, toxicity, metabolism
  • 53. How do we build it?  25 million chemicals from 400 data sources  We deal in Molfiles or SDF files – including coordinates  We do rudimentary filtering – valence checking, charge imbalance – prior to deposition  We have our own “business logic” to standardize  We use InChI to “aggregate tautomers” to one record  We link out to external sites where possible using their IDs
  • 54. Inherited Errors  We have inherited errors from every database… all public compound databases, including ours, have errors  “Incorrect” structures – assertions, timelines etc  “Incorrect” names associated with structures  Properties  Links  Publications  ENORMOUS CHALLENGE
  • 56. Be careful searching by Name!  Determining the correct structure by name searching is difficult online! Good, not perfect  Wikipedia  ChEBI/ChEMBL  ChemIDPlus  ChemSpider  Be VERY careful with MOST databases
  • 57. Validating structures  Check for “full stereo” and use stereo descriptors especially for checking!  Check for quality of associated data sources  Check against reference literature when available – but it can be wrong  Question EVERYTHING!
  • 58. Online Curation  Online databases generally do NOT allow curation or annotation  If you find errors they stay there!  ChemSpider is unique…immediate curation  ChemSpider live demo following this lecture  Searching  Deposition and Curation  ChemSpider SyntheticPages
  • 59. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams