SlideShare a Scribd company logo
1 of 1
On the Accuracy of Chemical Structures Found on the Internet
                                                          Andrew D. Fant1, Eugene Muratov1, Denis Fourches1, David Sharpe2, Antony J. Williams2, and Alexander Tropsha1
                                                                   1   Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill;
                                                                                                               2 Royal Society of Chemistry

 Figure 1: Which structure of the top-selling anti-glaucoma drug
 dorzolamide is correct?                                                          Methods                                                               Results                                                                   • Structures from the consensus master list were compared (as
                                                                                                                                                                                                                                    InChI keys) to hits on name searches against several well-known
                                                                                  • An initial master set of 151 (out of 200) names was generated by    • Out of 151 total compounds, all 4 groups reported a structure             chemical structure repositories. The number of correct structures
                                                                                    one author.                                                           identical to that on the initial master list for 113 compounds            and total number of hits are summarized in Figure 7
                                                                                  • Each team was required to return the following information            (74.8%).                                                                • Incorrect structures in ChemSpider were corrected when
                                                                                    about each compound: systematic name, MOL-formatted                 • No compound was incorrectly reported by all 4 groups; no group            found, but not counted as correct for this analysis.
                                                                                    record, and JPEG/PNG/GIF image.                                       achieved 100% accuracy (Figure 5)
                                                                                  • The following search workflows were employed:                                                                                                 Figure 7: Accuracy of results from public structure
                                                                                                                                                        • Differing results between the curated and unsupervised structure        repositories
                                                                                      • The UNC workflow (Figure 3) was based entirely on open
                                                                                                                                                          determination methods are highly significant by Fisher’s Exact
                                                                                         Internet data repositories and included some manual
                                                                                                                                                          Test.
ChemSpider ID 4447604                            ChemSpider ID 23499154                  reentry of structures from PDF sources.
                                                                                                                                                        Figure 5: Relative accuracy of groups against final master
                                                                                                                                                        list
                                                                                  Figure 3: UNC workflow – Name to structure resolution
Motivation
• It is axiomatic that data stored in chemical databases must be accurate; yet
  it has been reported the error rate in freely-accessible public databases may
  exceed 8%.1 A recent example comes from the NCGC (National Chemical
  Genomics Center) pharmaceutical collection (Figure 2).2
                                                                                                                                                                                                                                  Conclusions
• When building computational models of chemical properties, one wrong                                                                                                                                                                 Identifying correct chemical structures from compound
  structure in twenty is enough to reduce the reliability and prediction                                                                                                                                                                names utilizing publicly available resources on the
  performance of the model.3                                                                                                                                                                                                            Internet is possible, but not trivial.
• Chemical data curation is labor-intensive, perhaps unexciting but critical;
                                                                                                                                                                                                                                       Success requires careful comparison of multiple
  but it should be recognized and supported as an inseparable component of
                                                                                                                                                        Figure 6: Examples of problematic structures and sources                        resources. No single source is correct in all cases.
  cheminformatics research                                                                                                                              of disagreement
Figure 2: “Neomycin” – First six structures retrieved from the                        •   The RSC workflow (Figure 4) was more iterative in the early                         Tautomeric Forms                                         Automated Internet queries are still significantly less
NCGC browser                                                                              stages, and included redistribution-restricted sources in                              Vardenafil
                                                                                                                                                                                                                                        accurate than manually guided searches.
                                                                                          some cases.
                                                                                                                                                                                                                                       InChI strings and keys are an improvement in chemical
                                                                                  Figure 4: RSC workflow – Name to structure resolution
                                                                                                                                                                                                                                        data handling, but the current standard keys are not
                                                                                                                                                                                                                                        perfect for large-scale comparisons.
                                                                                                                                                                               Pro-drug Forms
                                                                                                                                                                                 Olmesartan                                            We believe that the adoption of the MIABE (Minimum
                                                                                                                                                                                                                                        Information About Bioactive Entity) standard5 as part of
                                                                                                                                                                                                                                        the peer-reviewed literature publication process could
                                                                                                                                                                                                                                        improve the quality of public structural information by
                                                                                                                                                                                                                                        eliminating manual re-entry of structures from the
Study Design                                                                                                                                                                                                                            primary literature as is currently required in most cases.
• Select and curate a list of the top-200 selling drugs (as of 2006 from                                                                                                       Chiral Sulphoxides

  Wikipedia).
                                                                                                                                                                                 Esomeprazole                                          It is insufficient for a database to return the correct
• Distribute the list to four independent groups of cheminformaticians and                                                                                                                                                              structure from a name query. It also should minimize
  ask each group to generate the structures of the drugs using their preferred                                                                                                                                                          (better, eliminate) the number of incorrect and/or
  methods.                                                                                                                                              ✔                                                 RSC, AZ & IMIM/CT ✗           auxiliary answers that are returned along with the correct
      • Royal Society of Chemistry (RSC)                                              •   The other two workflows (AZ and IMIM/CT) utilized were                              Wrong Chirality                                           one.
                                                                                                                                                                               Pravastatin
              • Manual Web Search                                                         more highly automated and are not described further in the
      • University of North Carolina (UNC)
              • Manual Web Search
                                                                                          current work.                                                                                                                           References
                                                                                                                                                                                                                                  1    Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb Sci 2008, 27, 1337–1345.
                                                                                                                                                                                                                                  2    Williams, A. J.; Ekins, S.; Tkachenko, V. Drug Discov Today 2012, 1–17.
      • AstraZeneca (AZ)                                                          • InChI keys were calculated from the returned molecular                                                                                        3    Fourches, D.; Muratov, E.; Tropsha, A. J Chem Inf Model 2010, 50, 1189–1204.
              • Automated Search of Pre-curated Internal Source4                                                                                        ✔                                                   UNC               ✗   4    Muresan, S. et al. Drug Discov Today 2011, 16, 1019–1030.
                                                                                    structures and compared. Discrepancies in structures were                                   Just Plain Wrong                                  5    Orchard, S. et al. Nat Rev Drug Discov 2011, 10, 661–669.
      • Institut de Recera Hospital del Mar/Chemotargets S.L. (IMIM/CT)             discussed by participants and a consensus was reached on which                                 Paclitaxel
              • Automated Internet Search                                           structure for the compound was supported by the evidence                                                                                      Acknowledgements
• Compare the results and discuss any discrepancies until agreement on the                                                                                                                                                        The authors would like to thank Ricard Garcia (Chemotargets S.L), Jordi Mestres
                                                                                    available, leading to the final master list.                                                                                                  (Barcelona IMIM), Sorel Muresan and Christopher Southan (AstraZeneca), and Andrey
  correct structure is reached.                                                                                                                                                                                                   Erin (ACD/Labs) for their participation in the search for Internet drug structures. Phyllis
• Once a master list is established, compare those structures to individual                                                                                                                                                       Pugh provided workflow graphics and statistical consulting. We acknowledge software
                                                                                                                                                                                                                                  licenses donated by OpenEye Scientific Software, ChemAxon, and ACD/Labs that were
  public chemical structure sources.                                                                                                                    ✔                                                   UNC & IMIM/CT
                                                                                                                                                                                                                              ✗   used for portions of the data collection and analysis. AT acknowledges financial support
                                                                                                                                                                                                                                  from NIH (grant GM66940) and EPA (grant RD 83499901 ).

More Related Content

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

On the Accuracy of Chemical Structures Found on the Internet

  • 1. On the Accuracy of Chemical Structures Found on the Internet Andrew D. Fant1, Eugene Muratov1, Denis Fourches1, David Sharpe2, Antony J. Williams2, and Alexander Tropsha1 1 Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill; 2 Royal Society of Chemistry Figure 1: Which structure of the top-selling anti-glaucoma drug dorzolamide is correct? Methods Results • Structures from the consensus master list were compared (as InChI keys) to hits on name searches against several well-known • An initial master set of 151 (out of 200) names was generated by • Out of 151 total compounds, all 4 groups reported a structure chemical structure repositories. The number of correct structures one author. identical to that on the initial master list for 113 compounds and total number of hits are summarized in Figure 7 • Each team was required to return the following information (74.8%). • Incorrect structures in ChemSpider were corrected when about each compound: systematic name, MOL-formatted • No compound was incorrectly reported by all 4 groups; no group found, but not counted as correct for this analysis. record, and JPEG/PNG/GIF image. achieved 100% accuracy (Figure 5) • The following search workflows were employed: Figure 7: Accuracy of results from public structure • Differing results between the curated and unsupervised structure repositories • The UNC workflow (Figure 3) was based entirely on open determination methods are highly significant by Fisher’s Exact Internet data repositories and included some manual Test. ChemSpider ID 4447604 ChemSpider ID 23499154 reentry of structures from PDF sources. Figure 5: Relative accuracy of groups against final master list Figure 3: UNC workflow – Name to structure resolution Motivation • It is axiomatic that data stored in chemical databases must be accurate; yet it has been reported the error rate in freely-accessible public databases may exceed 8%.1 A recent example comes from the NCGC (National Chemical Genomics Center) pharmaceutical collection (Figure 2).2 Conclusions • When building computational models of chemical properties, one wrong  Identifying correct chemical structures from compound structure in twenty is enough to reduce the reliability and prediction names utilizing publicly available resources on the performance of the model.3 Internet is possible, but not trivial. • Chemical data curation is labor-intensive, perhaps unexciting but critical;  Success requires careful comparison of multiple but it should be recognized and supported as an inseparable component of Figure 6: Examples of problematic structures and sources resources. No single source is correct in all cases. cheminformatics research of disagreement Figure 2: “Neomycin” – First six structures retrieved from the • The RSC workflow (Figure 4) was more iterative in the early Tautomeric Forms  Automated Internet queries are still significantly less NCGC browser stages, and included redistribution-restricted sources in Vardenafil accurate than manually guided searches. some cases.  InChI strings and keys are an improvement in chemical Figure 4: RSC workflow – Name to structure resolution data handling, but the current standard keys are not perfect for large-scale comparisons. Pro-drug Forms Olmesartan  We believe that the adoption of the MIABE (Minimum Information About Bioactive Entity) standard5 as part of the peer-reviewed literature publication process could improve the quality of public structural information by eliminating manual re-entry of structures from the Study Design primary literature as is currently required in most cases. • Select and curate a list of the top-200 selling drugs (as of 2006 from Chiral Sulphoxides Wikipedia). Esomeprazole  It is insufficient for a database to return the correct • Distribute the list to four independent groups of cheminformaticians and structure from a name query. It also should minimize ask each group to generate the structures of the drugs using their preferred (better, eliminate) the number of incorrect and/or methods. ✔ RSC, AZ & IMIM/CT ✗ auxiliary answers that are returned along with the correct • Royal Society of Chemistry (RSC) • The other two workflows (AZ and IMIM/CT) utilized were Wrong Chirality one. Pravastatin • Manual Web Search more highly automated and are not described further in the • University of North Carolina (UNC) • Manual Web Search current work. References 1 Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb Sci 2008, 27, 1337–1345. 2 Williams, A. J.; Ekins, S.; Tkachenko, V. Drug Discov Today 2012, 1–17. • AstraZeneca (AZ) • InChI keys were calculated from the returned molecular 3 Fourches, D.; Muratov, E.; Tropsha, A. J Chem Inf Model 2010, 50, 1189–1204. • Automated Search of Pre-curated Internal Source4 ✔ UNC ✗ 4 Muresan, S. et al. Drug Discov Today 2011, 16, 1019–1030. structures and compared. Discrepancies in structures were Just Plain Wrong 5 Orchard, S. et al. Nat Rev Drug Discov 2011, 10, 661–669. • Institut de Recera Hospital del Mar/Chemotargets S.L. (IMIM/CT) discussed by participants and a consensus was reached on which Paclitaxel • Automated Internet Search structure for the compound was supported by the evidence Acknowledgements • Compare the results and discuss any discrepancies until agreement on the The authors would like to thank Ricard Garcia (Chemotargets S.L), Jordi Mestres available, leading to the final master list. (Barcelona IMIM), Sorel Muresan and Christopher Southan (AstraZeneca), and Andrey correct structure is reached. Erin (ACD/Labs) for their participation in the search for Internet drug structures. Phyllis • Once a master list is established, compare those structures to individual Pugh provided workflow graphics and statistical consulting. We acknowledge software licenses donated by OpenEye Scientific Software, ChemAxon, and ACD/Labs that were public chemical structure sources. ✔ UNC & IMIM/CT ✗ used for portions of the data collection and analysis. AT acknowledges financial support from NIH (grant GM66940) and EPA (grant RD 83499901 ).