SlideShare a Scribd company logo
Standardization and Generation of Parents
for
Open PHACTS Chemical Registry System
Karen Karapetyan, Valery Tkachenko
Colin Batchelor, Antony Williams
Validation checks
 Correct file format (SDF, MOL, CDX, etc)
 “Valid” chemical structure
 Valid atoms (not query atoms)
 Valid bonds
 Valid valences
 Valid charges
 SP3 stereo
 Synonyms
 Names (name to structure)
 SMILES, InChIs (SMILES/InChI to structure)
 XRefs
Severity assigned to every validation issue
Filtering by severity and by issues
Standardization – Organometallics/Salts
 Always disconnect N, O, and F from metals:
 Disconnect nonmetals (except N,O,F) with transition metals (except Hg)
 Ionize free metal with carboxylic acid (Metals of Group I and II)
Standardization SMIRKS
(based on InChI normalization and on FDA SRS)
Examples of InChI normalization
 [*;H+:1]>>[*;H:1]
 [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]
>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
 [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules
 [n:1]=[O:2]>>[n+:1][O-:2]
 [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
 [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
 Thiopurine
[H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[
H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3]
1=[S:2]
Standardization
 Dearomatize
 Double bond with adjacent wiggly single bond
 Fold hydrogen atoms with no up or down bonds
Standardization
 Remove symmetric stereocenters
 Turn off chiral flag if no up or down bonds
 Do Layout
Chiral flag is set
Standardization – partially ionized acids
(move proton from strong acids to a weaker)
For each Compound parent generation is attempted
“Tautomerism in large databases”, Sitzmann and others,
J.Comput Aided Mol Des (2010)
Parent Description RDF
Charge-Unsensitive An attempt is made to neutralize ionized acids
and bases. Envisioned to be an ongoing
improvement while new cases appear.
void:linkPredicate skos:closeMatch
dul:expresses cheminf:CHEMINF_000460;
Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000459
Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch
cheminf:CHEMINF_000456
Tautomer-
Unsensitive
Tautomer canonicalization is attempting to
generate a canonical tautomer
void:linkPredicate skos:closeMatch;
dul:expresses cheminf:CHEMINF_000486;
Super Parent Super parent is generated by applying
modifications of all of the above
void:linkPredicate skos:broadMatch;
dul:expresses cheminf:CHEMINF_000458;
Fragment
SID 1
SDF1
DataSource1
Synonym1
Synonym2
XRef1
SID 2
SDF2
DataSource2
Synonym1
Synonym3
XRef2
OPS_ID 1
Deposited
Substances
Parents
Standardized
MOLECULE
DataSource1
DataSource2
Synonym1
Synonym2
Synonym3
XRef1
XRef2
Charge Parent (OPS_ID 6)
Isotope Parent (OPS_ID 4)
Stereo Parent (OPS_ID 3)
Tautomer Parent (OPS_ID 5)
Super Parent (OPS_ID 7)
Compounds
OPS_ID 2
Standardized
MOL
DataSource3
DataSource4
Synonym4
Synonym5
Synonym6
XRef3
XRef4
What do we use as chemical identity of the standardized records
(primary compound key)?
• Standard InChI/InChIKey (currently used ChemSpider)
• Absolute smiles (isomeric canonical)
Drawbacks
• SMILES – can be too long; no accepted standard; needs to be hashed
• Standard InChI
• does not distinguish between undefined and unknown stereo
• by default standard InChI does some basic tautomer canonicalization
(not needed in new model)
• By default assumes absolute stereo
Proposed Solution
Non-standard InChI with options: SUU SLUUD FixedH SUCF
• much more sensitive to stereo description
• Fixes mobile hydrogens (so tautomers could be distinguished)
• Handles “AND-ed” relative stereo
Thanks
We would appreciate any comments.
For comments or questions email
karapetyank@rsc.org

More Related Content

What's hot

Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
NextMove Software
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
NextMove Software
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patents
NextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
NextMove Software
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
NextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
NextMove Software
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
NextMove Software
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
NextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
NextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
NextMove Software
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
NextMove Software
 
Uni protsparqlcloud
Uni protsparqlcloudUni protsparqlcloud
Uni protsparqlcloud
Jerven Bolleman
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
dan2097
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
NextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
NextMove Software
 

What's hot (17)

Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...Classification, representation and analysis of cyclic peptides and peptide-li...
Classification, representation and analysis of cyclic peptides and peptide-li...
 
Unlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articlesUnlocking chemical information from tables and legacy articles
Unlocking chemical information from tables and legacy articles
 
Chemistry and reactions from non-US patents
Chemistry and reactions from non-US patentsChemistry and reactions from non-US patents
Chemistry and reactions from non-US patents
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)Line notations for nucleic acids (both natural and therapeutic)
Line notations for nucleic acids (both natural and therapeutic)
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
CHAS 31: Encoding reactive chemical hazards and incompatibilities in an alert...
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Comm...
 
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
 
Sketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sightSketchy sketches hiding chemistry in plain sight
Sketchy sketches hiding chemistry in plain sight
 
Uni protsparqlcloud
Uni protsparqlcloudUni protsparqlcloud
Uni protsparqlcloud
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 

Viewers also liked

Universidad Nacional de Chimborazo Proyecto de Estadistica
Universidad Nacional de Chimborazo   Proyecto de EstadisticaUniversidad Nacional de Chimborazo   Proyecto de Estadistica
Universidad Nacional de Chimborazo Proyecto de Estadistica
Dario Pilco
 
Rump : iOS patch diffing
Rump : iOS patch diffingRump : iOS patch diffing
Rump : iOS patch diffing
Cyber Security Alliance
 
Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013
Experiencia Trading
 
Digital Marketing and Social Personal Media
Digital Marketing and Social Personal MediaDigital Marketing and Social Personal Media
Digital Marketing and Social Personal Media
Ib Potter
 
AgTechXChange
AgTechXChangeAgTechXChange
AgTechXChange
Liz Caselli-Mechael
 
Enhancing the intranet with gamification
Enhancing the intranet with gamificationEnhancing the intranet with gamification
Enhancing the intranet with gamification
Alex Manchester
 
أهمية الوقت
أهمية الوقتأهمية الوقت
أهمية الوقت
Sabry Zein
 
Keene Neighborhood
Keene NeighborhoodKeene Neighborhood
Keene Neighborhood
Jenny Darrow
 
JavaFund
JavaFundJavaFund
JavaFund
Oliver Sawtell
 
Planning and development club, November 2016
Planning and development club, November 2016Planning and development club, November 2016
Planning and development club, November 2016
Browne Jacobson LLP
 
Top-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on EarthTop-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on Earth
TechAhead
 
Rosie Clarke - Answer me this!
Rosie Clarke - Answer me this!Rosie Clarke - Answer me this!
Rosie Clarke - Answer me this!
Museums Computer Group
 
China: kicking the can down the road
China: kicking the can down the roadChina: kicking the can down the road
China: kicking the can down the road
RBS Economics
 
News A 40 2016
News A 40 2016News A 40 2016
News A 40 2016
Roberta Culiersi
 
פרויקט בחוף בת ים
פרויקט בחוף בת יםפרויקט בחוף בת ים
פרויקט בחוף בת ים
shartal10
 
OXO Soluitions
OXO SoluitionsOXO Soluitions
OXO Soluitions
OXO IT SOLUTIONS PVT LTD
 
Web security - Presented to the Shelbyville Rotary November 2014
Web security - Presented to the Shelbyville Rotary November 2014Web security - Presented to the Shelbyville Rotary November 2014
Web security - Presented to the Shelbyville Rotary November 2014
Lorraine Ball
 
Technology & Us
Technology & UsTechnology & Us
Technology & Us
Uriel Shuraki
 
ECRI INSTITUTE - Monitores Fetales, Parte I
ECRI INSTITUTE - Monitores Fetales, Parte IECRI INSTITUTE - Monitores Fetales, Parte I
ECRI INSTITUTE - Monitores Fetales, Parte I
Rigoberto José Meléndez Cuauro
 

Viewers also liked (20)

Universidad Nacional de Chimborazo Proyecto de Estadistica
Universidad Nacional de Chimborazo   Proyecto de EstadisticaUniversidad Nacional de Chimborazo   Proyecto de Estadistica
Universidad Nacional de Chimborazo Proyecto de Estadistica
 
Rump : iOS patch diffing
Rump : iOS patch diffingRump : iOS patch diffing
Rump : iOS patch diffing
 
Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013Grafico diario del dax perfomance index para el 07 11-2013
Grafico diario del dax perfomance index para el 07 11-2013
 
Digital Marketing and Social Personal Media
Digital Marketing and Social Personal MediaDigital Marketing and Social Personal Media
Digital Marketing and Social Personal Media
 
AgTechXChange
AgTechXChangeAgTechXChange
AgTechXChange
 
Enhancing the intranet with gamification
Enhancing the intranet with gamificationEnhancing the intranet with gamification
Enhancing the intranet with gamification
 
أهمية الوقت
أهمية الوقتأهمية الوقت
أهمية الوقت
 
Keene Neighborhood
Keene NeighborhoodKeene Neighborhood
Keene Neighborhood
 
JavaFund
JavaFundJavaFund
JavaFund
 
Planning and development club, November 2016
Planning and development club, November 2016Planning and development club, November 2016
Planning and development club, November 2016
 
Top-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on EarthTop-Notch Slimmest Smartphones on Earth
Top-Notch Slimmest Smartphones on Earth
 
Rosie Clarke - Answer me this!
Rosie Clarke - Answer me this!Rosie Clarke - Answer me this!
Rosie Clarke - Answer me this!
 
China: kicking the can down the road
China: kicking the can down the roadChina: kicking the can down the road
China: kicking the can down the road
 
News A 40 2016
News A 40 2016News A 40 2016
News A 40 2016
 
פרויקט בחוף בת ים
פרויקט בחוף בת יםפרויקט בחוף בת ים
פרויקט בחוף בת ים
 
OXO Soluitions
OXO SoluitionsOXO Soluitions
OXO Soluitions
 
กรอบไทย
กรอบไทยกรอบไทย
กรอบไทย
 
Web security - Presented to the Shelbyville Rotary November 2014
Web security - Presented to the Shelbyville Rotary November 2014Web security - Presented to the Shelbyville Rotary November 2014
Web security - Presented to the Shelbyville Rotary November 2014
 
Technology & Us
Technology & UsTechnology & Us
Technology & Us
 
ECRI INSTITUTE - Monitores Fetales, Parte I
ECRI INSTITUTE - Monitores Fetales, Parte IECRI INSTITUTE - Monitores Fetales, Parte I
ECRI INSTITUTE - Monitores Fetales, Parte I
 

More from Ken Karapetyan

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Ken Karapetyan
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
Ken Karapetyan
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
Ken Karapetyan
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
Ken Karapetyan
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
Ken Karapetyan
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discovery
Ken Karapetyan
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
Ken Karapetyan
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
Ken Karapetyan
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Ken Karapetyan
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
Ken Karapetyan
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Ken Karapetyan
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
Ken Karapetyan
 
SERMACS 2012
SERMACS 2012SERMACS 2012
SERMACS 2012
Ken Karapetyan
 

More from Ken Karapetyan (13)

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...The RSC chemical validation and standardization platform, a potential path to...
The RSC chemical validation and standardization platform, a potential path to...
 
Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...Building support for the semantic web for chemistry at the Royal Society of C...
Building support for the semantic web for chemistry at the Royal Society of C...
 
Royal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discoveryRoyal society of chemistry developments to support open drug discovery
Royal society of chemistry developments to support open drug discovery
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Data enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archiveData enhancing the royal society of chemistry publication archive
Data enhancing the royal society of chemistry publication archive
 
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
Applying Royal Society of Chemistry cheminformatics skills to support the Pha...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data onlineDealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
SERMACS 2012
SERMACS 2012SERMACS 2012
SERMACS 2012
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 

Standardization and Generation of Parents for Open PHACTS Chemical Registry System

  • 1. Standardization and Generation of Parents for Open PHACTS Chemical Registry System Karen Karapetyan, Valery Tkachenko Colin Batchelor, Antony Williams
  • 2. Validation checks  Correct file format (SDF, MOL, CDX, etc)  “Valid” chemical structure  Valid atoms (not query atoms)  Valid bonds  Valid valences  Valid charges  SP3 stereo  Synonyms  Names (name to structure)  SMILES, InChIs (SMILES/InChI to structure)  XRefs
  • 3. Severity assigned to every validation issue
  • 4. Filtering by severity and by issues
  • 5. Standardization – Organometallics/Salts  Always disconnect N, O, and F from metals:  Disconnect nonmetals (except N,O,F) with transition metals (except Hg)  Ionize free metal with carboxylic acid (Metals of Group I and II)
  • 6. Standardization SMIRKS (based on InChI normalization and on FDA SRS) Examples of InChI normalization  [*;H+:1]>>[*;H:1]  [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3] >>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]  [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Examples of FDA SRS rules  [n:1]=[O:2]>>[n+:1][O-:2]  [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]  [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]  Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]([H,*:12])[n:9]2>>[ H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]([H,*:12])[N:9]2)[C:3] 1=[S:2]
  • 7. Standardization  Dearomatize  Double bond with adjacent wiggly single bond  Fold hydrogen atoms with no up or down bonds
  • 8. Standardization  Remove symmetric stereocenters  Turn off chiral flag if no up or down bonds  Do Layout Chiral flag is set
  • 9. Standardization – partially ionized acids (move proton from strong acids to a weaker)
  • 10. For each Compound parent generation is attempted “Tautomerism in large databases”, Sitzmann and others, J.Comput Aided Mol Des (2010) Parent Description RDF Charge-Unsensitive An attempt is made to neutralize ionized acids and bases. Envisioned to be an ongoing improvement while new cases appear. void:linkPredicate skos:closeMatch dul:expresses cheminf:CHEMINF_000460; Isotope-Unsensitive Isotopes replaced by common weight void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000459 Stereo-Unsensitive SP3 and double bond stereo removed void:linkPredicate skos:closeMatch cheminf:CHEMINF_000456 Tautomer- Unsensitive Tautomer canonicalization is attempting to generate a canonical tautomer void:linkPredicate skos:closeMatch; dul:expresses cheminf:CHEMINF_000486; Super Parent Super parent is generated by applying modifications of all of the above void:linkPredicate skos:broadMatch; dul:expresses cheminf:CHEMINF_000458;
  • 11. Fragment SID 1 SDF1 DataSource1 Synonym1 Synonym2 XRef1 SID 2 SDF2 DataSource2 Synonym1 Synonym3 XRef2 OPS_ID 1 Deposited Substances Parents Standardized MOLECULE DataSource1 DataSource2 Synonym1 Synonym2 Synonym3 XRef1 XRef2 Charge Parent (OPS_ID 6) Isotope Parent (OPS_ID 4) Stereo Parent (OPS_ID 3) Tautomer Parent (OPS_ID 5) Super Parent (OPS_ID 7) Compounds OPS_ID 2 Standardized MOL DataSource3 DataSource4 Synonym4 Synonym5 Synonym6 XRef3 XRef4
  • 12.
  • 13. What do we use as chemical identity of the standardized records (primary compound key)? • Standard InChI/InChIKey (currently used ChemSpider) • Absolute smiles (isomeric canonical) Drawbacks • SMILES – can be too long; no accepted standard; needs to be hashed • Standard InChI • does not distinguish between undefined and unknown stereo • by default standard InChI does some basic tautomer canonicalization (not needed in new model) • By default assumes absolute stereo Proposed Solution Non-standard InChI with options: SUU SLUUD FixedH SUCF • much more sensitive to stereo description • Fixes mobile hydrogens (so tautomers could be distinguished) • Handles “AND-ed” relative stereo
  • 14. Thanks We would appreciate any comments. For comments or questions email karapetyank@rsc.org

Editor's Notes

  1. I would like to start by defining what a “quality” record means, because that is what validation part of the CVSP is about. The chemical record has several aspect to its quality. One that is easiest to check is file format correctness. Each file format has its own formatting rules that record in that format needs to follow. This type of file validation is done by all the database maintainers that have deposition systems.Another, more relevant, type of validation is the chemical validation. A record can be perfectly formatted from file format point of view, but make no sense in chemical. So structure validation is something that is usually overlooked or not prioritized highly. Some of the chemical validations are atom validation – checking that atom is legal chemical atom, its charges and valences. That stereo is defined. Synonym validation is very useful for spotting records that are inconsistent and worth pointing depositor to look at them. Often during data export/import synonyms and/or structure are being manipulated and relationships between them can become faulty. So attempting to verify that synonym and structure actually match something worth doing.SMILES/INCHIs – again relationship between chemical record and depositor’s provided INCHI or SMILES can be faulty. As I’ll show later, this inconsistency could reveal a systematic issue with data set as sometimes INCHi or SMILEs do not match the structure.
  2. The result of processing is a list of records with validation messages in the middle. If record was standardized then “Standardized” column is present with the structure.
  3. Here is the bigger DrugBank dataset we have processed. Some warnings are shown in the dropdown list. Warnings about metals, stereo, enol presence, etc.