Improving Online Chemistry –  One Structure at a Time Antony Williams AstraZeneca, February 10 th  2012
We Have …Too Much Data!!!
It is so difficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?
Pharma Information..and the web… Literature Patents News Pipeline SAR CSRs Safety In vivo Etc
The World of Online Chemistry Property databases Compound aggregators Screening assay results Scientific publications  Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
PubChem
ChEMBL
Collaborative Knowledge Management
e-Science and Primary Data How much data generated in a lab, that  COULD  go public, is lost forever?
e-Science and Primary Data How much data generated in a lab, that  COULD  go public, is lost forever? Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images
e-Science and Primary Data How much data generated in a lab, that  COULD  go public, is lost forever? Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images Much of chemistry is chemical structure-based – where and how could we host these data?
RSC’s ChemSpider
We Want to Answer Questions Questions a chemist might ask… What is the melting point of n-heptanol?  What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
Available Information… Linked to vendors, safety data, toxicity, metabolism
Available Information….
Crowdsourced “Annotations” Users can add  Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs  Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
 
Spectra
Data on the Web
 
Chemistry Data online is messy We have inherited errors All public compound databases, including ours, have errors “ Incorrect” structures – assertions, timelines etc “ Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
What could create change? Harvard Business Review (2010) “ One change would make a substantial difference  [ to drug R&D ] :   the creation of agreed-upon standards for digitally representing drug assets. ” Consider drug structures ONLY…
The Structure of Vitamin K?
MeSH A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified:  VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
The Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
 
 
ChEBI – Manual Curation
 
 
 
“ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione” Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl  2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
What’s Methane?
What’s Methane?
What  ELSE  is Methane???
 
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
With Great Fanfare…
NPC Browser  http://tripod.nih.gov/npc/
NPC Browser  http://tripod.nih.gov/npc/
 
Openness and Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
Content is King and  Quality  Costs Curated Chemistry “content” is expensive to create Patent searching Structures and properties Drug databases Literature databases Chemical Abstracts Service  (CAS), the “Gold Standard” in Chemistry related information 104 years of content >50 million substances  Proprietary platform
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from Steve Ritter “ Although CAS and C&EN are both part of the ACS Publications Division,  we at C&EN still have to pay for our SciFinder access, strangely enough.” “ It would be  nice to have an authoritative web-based source of standard, well-drawn structures  for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need.  Maybe Wikipedia will be that source one day .”
Public Domain Databases Our  databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming and challenging
Stop Whining – Fix it
Crowdsourced Curation Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Search “Vitamin H”
“ Curate” Identifiers
“ Curate” Identifiers
“ Curate” Identifiers
Standards : Structure Standardization
Standards : Structure Standardization
Standards : Structure Standardization
What needs to happen? Standards Standardization of structures  ChEBI/PubChem sharing  InChI adoption
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Vancomycin –  Search the Internet
Vancomycin Search Molecular SKELETON Search Full Molecule
Full  Skeleton  Search: 104 Hits
Full  Molecule  Search: 4 Hits
Crowdsourcing Works >130 people have deposited data and participated in data curation Different level curators check each other More curators and depositors are encouraged!
What needs to happen? Standards Standardization of structures  ChEBI/PubChem sharing  InChI adoption Collaboration Stop reinventing the wheel Share data, share efforts and speed the process
Antony Williams vs Identifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID … .
Aspirin names and synonyms Text searches depend on correct association 335  suggested identifiers for Aspirin just on PubChem! Disambiguation  dictionaries are necessary, not just for authors!
 
The Final Search Strategy
All Those Names, One Structure
Curated Dictionaries Matter
Success Depends on Dictionaries
Validated Name-Structure Dictionaries Chemical name dictionaries are used for: Text-mining (publications, patents) Used to index PubMed and link to Google Patents Linking to other databases – think Biology! When structures are not available drug names link Searching the web Names link to structures link to InChIs
I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
Vincristine: Identifiers and Properties
Vincristine: Patents Linked by  Name
Vincristine: Articles Linked by  Name
ChemSpider Everywhere: What do computers want? Web services
Mass Spec Analysis
ChemSpider Interface
 
Tinuvin 328
Position sorted by references
Position 1 only
Web Services
Web Services Open Up Collaboration Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup Many academic sites integrating directly – metabonomics, name lookup, semantic markup Mobile app integration Commercial structure drawing packages
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider Everywhere Crowdsourced Curation of Spectra
ChemSpider Everywhere : ChemMobi
Structure Database Lookup
ChemSpider Resources for Chemistry
It is so difficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?
Open PHACTS Project Develop a set of robust standards… Implement  the standards in a  semantic integration hub Deliver  services to support drug discovery programs in pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project Guiding principle is open access, open usage, open source -  Key to standards adoption  -
 
The Future Commercial Software Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals Internet Data
The Future of Chemistry on the Web? Public compound databases federate & build a linked environment of validated data! Data validation needs are  not  ignored Publishers layer on information to make publications discoverable Public-Private  databases can be linked Open Data  proliferate The “ Semantic Web ” in action
Acknowledgments  The ChemSpider team Our data providers, depositors, collaborators and curators Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog:  www.chemconnector.com   SLIDES:  www.slideshare.net/AntonyWilliams

Improving online chemistry one structure at a time

  • 1.
    Improving Online Chemistry– One Structure at a Time Antony Williams AstraZeneca, February 10 th 2012
  • 2.
    We Have …TooMuch Data!!!
  • 3.
    It is sodifficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?
  • 4.
    Pharma Information..and theweb… Literature Patents News Pipeline SAR CSRs Safety In vivo Etc
  • 5.
    The World ofOnline Chemistry Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
  • 6.
  • 7.
  • 8.
  • 9.
    e-Science and PrimaryData How much data generated in a lab, that COULD go public, is lost forever?
  • 10.
    e-Science and PrimaryData How much data generated in a lab, that COULD go public, is lost forever? Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images
  • 11.
    e-Science and PrimaryData How much data generated in a lab, that COULD go public, is lost forever? Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images Much of chemistry is chemical structure-based – where and how could we host these data?
  • 12.
  • 13.
    We Want toAnswer Questions Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
  • 14.
    Available Information… Linkedto vendors, safety data, toxicity, metabolism
  • 15.
  • 16.
    Crowdsourced “Annotations” Userscan add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Chemistry Data onlineis messy We have inherited errors All public compound databases, including ours, have errors “ Incorrect” structures – assertions, timelines etc “ Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
  • 22.
    What could createchange? Harvard Business Review (2010) “ One change would make a substantial difference [ to drug R&D ] : the creation of agreed-upon standards for digitally representing drug assets. ” Consider drug structures ONLY…
  • 23.
    The Structure ofVitamin K?
  • 24.
    MeSH A lipidcofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants , VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
  • 25.
    The Structure ofVitamin K1?
  • 26.
    What is theStructure of Vitamin K1?
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    “ 2-methyl-3-(3,7,11,15-tetramethyl hexadec-2-enyl)naphthalene-1,4-dione”Variants of systematic names on PubChem 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
  • 36.
  • 37.
  • 38.
    What ELSE is Methane???
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
    NPC Browser http://tripod.nih.gov/npc/
  • 45.
    NPC Browser http://tripod.nih.gov/npc/
  • 46.
  • 47.
    Openness and QualityIssues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  • 48.
    Content is Kingand Quality Costs Curated Chemistry “content” is expensive to create Patent searching Structures and properties Drug databases Literature databases Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 104 years of content >50 million substances Proprietary platform
  • 49.
    The EXPERTS mustget it right?!
  • 50.
    Wikipedia, C&E News,PubChem C&E News (from ACS)
  • 51.
    Feedback from SteveRitter “ Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” “ It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day .”
  • 52.
    Public Domain DatabasesOur databases are a mess… Non-curated databases are proliferating errors We source and deposit data between databases Original sources of errors hard to determine Curation is time-consuming and challenging
  • 53.
  • 54.
    Crowdsourced Curation Crowd-sourcedcuration: identify/tag errors, edit names, synonyms, identify records to deprecate
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
    Standards : StructureStandardization
  • 60.
    Standards : StructureStandardization
  • 61.
    Standards : StructureStandardization
  • 62.
    What needs tohappen? Standards Standardization of structures ChEBI/PubChem sharing InChI adoption
  • 63.
  • 64.
  • 65.
  • 66.
    Vancomycin – Search the Internet
  • 67.
    Vancomycin Search MolecularSKELETON Search Full Molecule
  • 68.
    Full Skeleton Search: 104 Hits
  • 69.
    Full Molecule Search: 4 Hits
  • 70.
    Crowdsourcing Works >130people have deposited data and participated in data curation Different level curators check each other More curators and depositors are encouraged!
  • 71.
    What needs tohappen? Standards Standardization of structures ChEBI/PubChem sharing InChI adoption Collaboration Stop reinventing the wheel Share data, share efforts and speed the process
  • 72.
    Antony Williams vsIdentifiers Passport ID Dad, Tony, others SSN Green Card License 5 email addresses ChemSpiderman (blog, Twitter account, Facebook, Friendfeed) OpenID … .
  • 73.
    Aspirin names andsynonyms Text searches depend on correct association 335 suggested identifiers for Aspirin just on PubChem! Disambiguation dictionaries are necessary, not just for authors!
  • 74.
  • 75.
  • 76.
    All Those Names,One Structure
  • 77.
  • 78.
    Success Depends onDictionaries
  • 79.
    Validated Name-Structure DictionariesChemical name dictionaries are used for: Text-mining (publications, patents) Used to index PubMed and link to Google Patents Linking to other databases – think Biology! When structures are not available drug names link Searching the web Names link to structures link to InChIs
  • 80.
    I want toknow about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
  • 81.
  • 82.
  • 83.
  • 84.
    ChemSpider Everywhere: Whatdo computers want? Web services
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
    Web Services OpenUp Collaboration Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup Many academic sites integrating directly – metabonomics, name lookup, semantic markup Mobile app integration Commercial structure drawing packages
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
    It is sodifficult to navigate… What’s the structure? Are they in our file? What’s similar? What’s the target? Pharmacology data? Known Pathways? Working On Now? Connections to disease? Expressed in right cell type? Competitors? IP?
  • 100.
    Open PHACTS ProjectDevelop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project Guiding principle is open access, open usage, open source - Key to standards adoption -
  • 101.
  • 102.
    The Future CommercialSoftware Pre-competitive Data Open Science Open Data Publishers Educators Open Databases Chemical Vendors Small organic molecules Undefined materials Organometallics Nanomaterials Polymers Minerals Particle bound Links to Biologicals Internet Data
  • 103.
    The Future ofChemistry on the Web? Public compound databases federate & build a linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make publications discoverable Public-Private databases can be linked Open Data proliferate The “ Semantic Web ” in action
  • 104.
    Acknowledgments TheChemSpider team Our data providers, depositors, collaborators and curators Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
  • 105.
    Thank you Email:williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams