SlideShare a Scribd company logo
1 of 103
Chemical Information in the Big Data
Era: Data Quality, Data Integration
and Building a Profile for Yourself as
an Online Scientist
Antony Williams
ORCID ID:0000-0002-2668-4821
My background…
• From 1985-present day
• PhD’ed in the UK
• Canadian Government lab as postdoc
• Academia as NMR Facility Manager
• Fortune 500 Company as Technology Leader
• Start-up – product manager and CSO
• Consultant – chemistry informatics industry
• Entrepreneur – Created “ChemSpider”
• Publisher - Royal Society of Chemistry
• EPA-NCCT as cheminformatics expert
Of interest to faculty?
CASE Systems – Natural Products
CH3 14.40(fb)
CH3 16.80(fb)
CH2 19.30(fb)
CH2 21.60(fb)
CH3
21.70(fb)
CH2
24.40(fb)
CH3 26.40
C
33.50(fb)
CH3 33.50(fb)
CH2
38.30(fb) CH2
38.30(fb)
CH2
39.10(fb)
C
39.60
CH2 40.20
CH2 42.10(fb)
CH
55.50(fb) CH
56.20(fb)
C
106.10
CH2 106.20(fb)
CH
120.80 C
141.90
C
146.00
C
148.40
C
148.50
CH
151.30
C
153.00
NH2
N N N N O
CH3
CH3
CH3
CH3
CH3
CH2
NH2
N
N
N
N
O
d A ( 1 3 C ) : 1 . 7 1 9
d N ( 1 3 C ) : 2 . 0 1 6
d I ( 1 3 C ) : 2 . 3 1 3
m a x _ d A ( 1 3 C ) : 8 . 5 8 0
1
CH3
CH3
CH3
CH3
CH3CH2
NH2
N
N
N
N
O
d A ( 1 3 C ) : 3 . 5 3 4
d N ( 1 3 C ) : 4 . 8 1 2
d I ( 1 3 C ) : 3 . 6 8 4
m a x _ d A ( 1 3 C ) : 1 3 . 2 8 0
2
CH3
CH3
CH3
CH3
CH3
CH2
NH2
N
N
N N
O
d A ( 1 3 C ) : 4 . 0 1 0
d N ( 1 3 C ) : 4 . 6 6 2
d I ( 1 3 C ) : 3 . 6 1 0
m a x _ d A ( 1 3 C ) : 1 2 . 2 3 0
3
Maybe you know this???
Computational Analysis at NCCT
Public Access and Systems
My Hopes for Today
• Encourage you in the “era of participation”
• Provide an overview of tools available
• Share some stories, statistics and strategies
• Encourage you to “share for the sake of science”
OUTCOMES
• You will claim an ORCiD
• You take responsibility for your online profile
• You will invest >1 hour per week
I would tell a chemistry joke…
But all of the good ones…
An ambitious idea….
• Let’s map together all online chemistry data
and build systems to integrate it
• Heck, let’s integrate chemistry and biology
data and add in disease data too if we can
• Let’s extract property data and model it and
see if we can extract new relationships –
quantitative and qualitative
• Let’s make it all available on the web…for
free
What about this….
• We’re going to map the world
• We’re going to take photos of as many
places as we can and link them together
• We’ll let people annotate and curate the map
• Then let’s make it available free on the web
• We’ll make it available for decision making
• Put it on Mobile Devices, give it away…
Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
• Property databases
• Patents with chemical structures
• Drug Discovery data
• Scientific publications
• Compound aggregators
• Blogs/Wikis and Open Notebook Science
• ~35 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our journals
and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
RSC Books
Google Books
Organic Chemistry is hard…
…it has alkynes of trouble
Flavors of Chemistry
Molfiles
10 9 0 0 1 0 0 0 0 0 1 V2000
31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3 1 2 0 0 0 0
4 1 1 0 0 0 0
9 1 1 0 0 0 0
7 2 1 0 0 0 0
5 2 2 0 0 0 0
8 2 1 0 0 0 0
6 4 1 0 0 0 0
4 10 1 6 0 0 0
7 6 1 0 0 0 0
M END
Molfiles
• Molfiles are the primary exchange format
between structure drawing packages
• Can be different between different drawing
packages
• Most commonly carry X,Y coordinates for
layout
• Can support polymers, organometallics, etc.
• Can carry 3D coordinates
Stereo
Tautomeric forms
Chemists are good…
The InChI Identifier
InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages. No
variability as with SMILES
• InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
• Adopted by the community (databases,
blogs, Wikipedia) – good for searching the
internet
Multiple Layers
Tautomers
Stereo
InChIStrings Hash to InChIKeys
Structure search the web
Exact Search
Skeleton Search
Data Quality/Standardization
• MANY structures meant to be something online
are MISREPRESENTED.
• Commonly you will have better success finding
information by name searches than structure –
with many caveats of course…
• Validating chemical structure representations is
laborious work – and it’s shocking to review
data…
Data Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)Science Translational Medicine 2011
Data quality is a known issue
Data quality is a known issue
Patent data in public databases
Patent data in public databases
You just can’t trust atoms!
Depiction vs Accurate
Representation
Depiction vs Accurate
Representation
What is the Structure of Vitamin K1?
Date Quality Issues and $$$$
Many Names, One Structure
But big and often noisy
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
Name to Structure Conversion
Name to Structure Conversion
What could we get?
PhysChem first: Melting Points
• Melting/sublimation/decomposition points
extracted for 287,635 distinct compounds from
1976-2014 USPTO patent applications/grants
• Sanity checks used to flag dubious values –
probably 130-4°C
• Non-melting outcomes recorded e.g. mp 147-
150°C. (subl.)
• What models could be built?
Modeling “BIG data”
• Melting point models developed with ca. 300k compounds
• Required 34Gb memory and about 400MB disk space (zipped)
• Matrix with 2*1011
entries (300k molecules x 700k descriptors)
• >12k core-hours (>600 CPU-days) for parameter optimization
• Parallelized on > 600 cores with up to 24 cores per one task
• Consensus model as average of individual models
• Accuracy of consensus model is ~33.6 °C for drug-like region
compounds
• Models publicly available at http://ochem.eu
A Recent Talk
http://www.slideshare.net/AntonyWilliams/
ESI – Text Spectra
ChemSpider ID 24528095 H1 NMR
We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
• What would be better are spectral figures – and
include assignments where possible!
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz,
C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
NMR Spectra
• 2,316,005 distinct spectra in 2001-2015 USPTO
Nucleus Count
H 1993384
C 173970
Unknown 107439
F 22158
P 16333
B 980
Si 715
Pt 275
N 170
V 101
ESI Data also contains figures
“Where is the real data please?”
FIGURE
DATA
Data added to ChemSpider
Visibility Means Discoverability
• Q: Does a Social Profile as a scientist matter?
• You are visible, when you share your skills,
experience and research activities by:
• Establishing a public profile
• Getting on the record
• Collaborative Science
• Demonstrating a skill set
• Measured using “alternative metrics”
• Contributing to the public peer review process
• There are many ways to become “visible”
Scientists measured by Impact
How to Measure Impact
Your Research Outputs?
• Research datasets
• Scientific software
• Publications – peer-reviewed and many others
• Posters and presentations at conferences
• Electronic theses and dissertations
• Performances in film and audio
• Lectures, online classes and teaching activities
• What else???
• The possibilities to share are endless
Open Researcher & Contributor ID
Here’s why they are useful…
Wonderful Profile…
CONTRIBUTE to the
community
• Share your expertise in the new world of open
• Share your Figures, share your data
• Contribute to Wikis – Wikipedia and others
• Participate in Open Notebook Science
• Build tools and platforms to support chemists
• Curate, use and comment on data
• Get engaged on blogs and discussions
Oxidation by Sodium Hydride?
The Blogosphere Analyzes…
The Blogosphere Analyzes…
The new world of micropublishing
ChemSpider SyntheticPages
Micropublishing with Peer Review
(a chemical synthesis blog?)
Multi-Step Synthesis
Interactive Data
You should be LinkedIn
• LinkedIn for “professionals”
• Expose work history, skills, your professional
interests, your memberships – your profile
WILL be watched!
• Who you are linked to says a lot about who
you are. Get Linked to people in your
domain.
• Professional relationships rather than just
friendships. FaceBook-it for friends
LinkedIn
http://www.linkedin.com/in/AntonyWilliams
My Career Captured…
And “Endorsements”
Highlight “Projects”
Manage Articles Here Too.
…and presentations
My Google Scholar Profile
http://scholar.google.com/citations?user=O2L8nh4AAAAJ
“I don’t have any publications”
• This is YOUR choice! Conference Abstracts..
• You produce reports, presentations and
posters during your studies – share them !
Slideshare – Highly Accessed
Slideshare – EXPANDED Audience
Fast Network Communication
Slideshare – NOT Just Slides
ResearchGate
https://www.researchgate.net/profile/Antony_Williams
ResearchGate
ResearchGate
I have a set of statistics & profiles
• My Blog: www.chemconnector.com
• Twitter: http://twitter.com/ChemConnector
• ORCID: http://orcid.org/0000-0002-2668-4821
• Amazon Author Page: Follow Link to Author Page
• My Klout: http://www.klout.com/#/ChemConnector
• LinkedIn: http://www.linkedin.com/in/antonywilliams
• SlideShare: http://www.slideshare.net/AntonyWilliams
• Google Scholar Citations Profile: Antony Williams Citations
• Wikipedia : http://en.wikipedia.org/wiki/Antony_John_Williams
The Power of Social Media
I recommend…
• Register for an ORCID ID – then use it
• Develop your LinkedIn profile
• Publish to Slideshare
• Track Google Scholar Citations (for now)
• Choose: ResearchGate or Academia.edu
• Set up an About.ME page to link everything
• Participate in building your profile
Thank you
Email: tony27587@gmail.com
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

What's hot

What's hot (20)

ChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platformsChemSpider – disseminating data and enabling an abundance of chemistry platforms
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
The future of scientific information & communication
The future of scientific information & communicationThe future of scientific information & communication
The future of scientific information & communication
 
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts projectOpen innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
 
The application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archiveThe application of text and data mining to enhance the RSC publication archive
The application of text and data mining to enhance the RSC publication archive
 
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry dataChemSpider as an integration hub for interlinked chemistry data
ChemSpider as an integration hub for interlinked chemistry data
 
Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...Big data challenges associated with building a national data repository for c...
Big data challenges associated with building a national data repository for c...
 
The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...The UK National Chemical Database Service – an integration of commercial and ...
The UK National Chemical Database Service – an integration of commercial and ...
 
eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...eScience Resources for the Chemistry Community from the Royal Society of Chem...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
 
eScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiativeseScience at the Royal Society of Chemistry and our current initiatives
eScience at the Royal Society of Chemistry and our current initiatives
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
 
Encouraging undergraduate students to participate as authors of scientific pu...
Encouraging undergraduate students to participate as authors of scientific pu...Encouraging undergraduate students to participate as authors of scientific pu...
Encouraging undergraduate students to participate as authors of scientific pu...
 
Digitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy databaseDigitizing documents to provide a public spectroscopy database
Digitizing documents to provide a public spectroscopy database
 
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
The expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry communityThe expansive reach of ChemSpider as a resource for the chemistry community
The expansive reach of ChemSpider as a resource for the chemistry community
 

Similar to Data integration and building a profile for yourself as an online scientist

How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014Richard West
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databasesChris Southan
 
From Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven Chemistry
From Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven ChemistryFrom Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven Chemistry
From Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven ChemistryGeoffrey Hutchison
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Lourdes Verdes-Montenegro
 

Similar to Data integration and building a profile for yourself as an online scientist (20)

Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...Current initiatives in developing research data repositories at the Royal Soc...
Current initiatives in developing research data repositories at the Royal Soc...
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
Navigating an Internet of Chemistry via ChemSpider
Navigating an Internet of Chemistry via ChemSpiderNavigating an Internet of Chemistry via ChemSpider
Navigating an Internet of Chemistry via ChemSpider
 
Building a data repository to manage chemistry research data
Building a data repository to manage chemistry research dataBuilding a data repository to manage chemistry research data
Building a data repository to manage chemistry research data
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...
 
Approaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical dataApproaches for extraction and digital chromatography of chemical data
Approaches for extraction and digital chromatography of chemical data
 
Accessing chemical health and safety data online using Royal Society of Chemi...
Accessing chemical health and safety data online using Royal Society of Chemi...Accessing chemical health and safety data online using Royal Society of Chemi...
Accessing chemical health and safety data online using Royal Society of Chemi...
 
Using online chemistry databases to facilitate structure identification in ma...
Using online chemistry databases to facilitate structure identification in ma...Using online chemistry databases to facilitate structure identification in ma...
Using online chemistry databases to facilitate structure identification in ma...
 
RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014RMG at the Flame Chemistry Workshop 2014
RMG at the Flame Chemistry Workshop 2014
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Chemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet EraChemistry data: Distortion and dissemination in the Internet Era
Chemistry data: Distortion and dissemination in the Internet Era
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
Quality and noise in big chemistry databases
Quality and noise in big chemistry databasesQuality and noise in big chemistry databases
Quality and noise in big chemistry databases
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
From Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven Chemistry
From Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven ChemistryFrom Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven Chemistry
From Robert Boyle’s 
The Sceptical Chymist to 
Modern Data-Driven Chemistry
 
Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...Is the current measure of excellence perverting Science? A Data deluge is com...
Is the current measure of excellence perverting Science? A Data deluge is com...
 

Recently uploaded

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curveAreesha Ahmad
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxrohankumarsinghrore1
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfrohankumarsinghrore1
 

Recently uploaded (20)

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 

Data integration and building a profile for yourself as an online scientist

  • 1. Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online Scientist Antony Williams ORCID ID:0000-0002-2668-4821
  • 2. My background… • From 1985-present day • PhD’ed in the UK • Canadian Government lab as postdoc • Academia as NMR Facility Manager • Fortune 500 Company as Technology Leader • Start-up – product manager and CSO • Consultant – chemistry informatics industry • Entrepreneur – Created “ChemSpider” • Publisher - Royal Society of Chemistry • EPA-NCCT as cheminformatics expert
  • 3. Of interest to faculty?
  • 4. CASE Systems – Natural Products CH3 14.40(fb) CH3 16.80(fb) CH2 19.30(fb) CH2 21.60(fb) CH3 21.70(fb) CH2 24.40(fb) CH3 26.40 C 33.50(fb) CH3 33.50(fb) CH2 38.30(fb) CH2 38.30(fb) CH2 39.10(fb) C 39.60 CH2 40.20 CH2 42.10(fb) CH 55.50(fb) CH 56.20(fb) C 106.10 CH2 106.20(fb) CH 120.80 C 141.90 C 146.00 C 148.40 C 148.50 CH 151.30 C 153.00 NH2 N N N N O CH3 CH3 CH3 CH3 CH3 CH2 NH2 N N N N O d A ( 1 3 C ) : 1 . 7 1 9 d N ( 1 3 C ) : 2 . 0 1 6 d I ( 1 3 C ) : 2 . 3 1 3 m a x _ d A ( 1 3 C ) : 8 . 5 8 0 1 CH3 CH3 CH3 CH3 CH3CH2 NH2 N N N N O d A ( 1 3 C ) : 3 . 5 3 4 d N ( 1 3 C ) : 4 . 8 1 2 d I ( 1 3 C ) : 3 . 6 8 4 m a x _ d A ( 1 3 C ) : 1 3 . 2 8 0 2 CH3 CH3 CH3 CH3 CH3 CH2 NH2 N N N N O d A ( 1 3 C ) : 4 . 0 1 0 d N ( 1 3 C ) : 4 . 6 6 2 d I ( 1 3 C ) : 3 . 6 1 0 m a x _ d A ( 1 3 C ) : 1 2 . 2 3 0 3
  • 5. Maybe you know this???
  • 8. My Hopes for Today • Encourage you in the “era of participation” • Provide an overview of tools available • Share some stories, statistics and strategies • Encourage you to “share for the sake of science” OUTCOMES • You will claim an ORCiD • You take responsibility for your online profile • You will invest >1 hour per week
  • 9. I would tell a chemistry joke… But all of the good ones…
  • 10. An ambitious idea…. • Let’s map together all online chemistry data and build systems to integrate it • Heck, let’s integrate chemistry and biology data and add in disease data too if we can • Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web…for free
  • 11.
  • 12. What about this…. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, give it away…
  • 13. Where is chemistry online? • Encyclopedic articles (Wikipedia) • Chemical vendor databases • Metabolic pathway databases • Property databases • Patents with chemical structures • Drug Discovery data • Scientific publications • Compound aggregators • Blogs/Wikis and Open Notebook Science
  • 14. • ~35 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  • 23. …it has alkynes of trouble
  • 25. Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
  • 26. Molfiles • Molfiles are the primary exchange format between structure drawing packages • Can be different between different drawing packages • Most commonly carry X,Y coordinates for layout • Can support polymers, organometallics, etc. • Can carry 3D coordinates
  • 31. InChI • SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES • InChI Strings can be reversed to structures – same problem as with SMILES – no layout • Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet
  • 35. InChIStrings Hash to InChIKeys
  • 39. Data Quality/Standardization • MANY structures meant to be something online are MISREPRESENTED. • Commonly you will have better success finding information by name searches than structure – with many caveats of course… • Validating chemical structure representations is laborious work – and it’s shocking to review data…
  • 40. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011)Science Translational Medicine 2011
  • 41. Data quality is a known issue
  • 42. Data quality is a known issue
  • 43. Patent data in public databases
  • 44. Patent data in public databases
  • 45. You just can’t trust atoms!
  • 48. What is the Structure of Vitamin K1?
  • 50. Many Names, One Structure
  • 51. But big and often noisy
  • 52. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 53. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 54. Name to Structure Conversion
  • 55. Name to Structure Conversion
  • 57. PhysChem first: Melting Points • Melting/sublimation/decomposition points extracted for 287,635 distinct compounds from 1976-2014 USPTO patent applications/grants • Sanity checks used to flag dubious values – probably 130-4°C • Non-melting outcomes recorded e.g. mp 147- 150°C. (subl.) • What models could be built?
  • 58. Modeling “BIG data” • Melting point models developed with ca. 300k compounds • Required 34Gb memory and about 400MB disk space (zipped) • Matrix with 2*1011 entries (300k molecules x 700k descriptors) • >12k core-hours (>600 CPU-days) for parameter optimization • Parallelized on > 600 cores with up to 24 cores per one task • Consensus model as average of individual models • Accuracy of consensus model is ~33.6 °C for drug-like region compounds • Models publicly available at http://ochem.eu
  • 60. ESI – Text Spectra
  • 62. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  • 63. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 64. NMR Spectra • 2,316,005 distinct spectra in 2001-2015 USPTO Nucleus Count H 1993384 C 173970 Unknown 107439 F 22158 P 16333 B 980 Si 715 Pt 275 N 170 V 101
  • 65. ESI Data also contains figures
  • 66. “Where is the real data please?” FIGURE DATA
  • 67. Data added to ChemSpider
  • 68. Visibility Means Discoverability • Q: Does a Social Profile as a scientist matter? • You are visible, when you share your skills, experience and research activities by: • Establishing a public profile • Getting on the record • Collaborative Science • Demonstrating a skill set • Measured using “alternative metrics” • Contributing to the public peer review process • There are many ways to become “visible”
  • 70. How to Measure Impact
  • 71. Your Research Outputs? • Research datasets • Scientific software • Publications – peer-reviewed and many others • Posters and presentations at conferences • Electronic theses and dissertations • Performances in film and audio • Lectures, online classes and teaching activities • What else??? • The possibilities to share are endless
  • 72. Open Researcher & Contributor ID
  • 73. Here’s why they are useful…
  • 75. CONTRIBUTE to the community • Share your expertise in the new world of open • Share your Figures, share your data • Contribute to Wikis – Wikipedia and others • Participate in Open Notebook Science • Build tools and platforms to support chemists • Curate, use and comment on data • Get engaged on blogs and discussions
  • 79. The new world of micropublishing
  • 81. Micropublishing with Peer Review (a chemical synthesis blog?)
  • 84. You should be LinkedIn • LinkedIn for “professionals” • Expose work history, skills, your professional interests, your memberships – your profile WILL be watched! • Who you are linked to says a lot about who you are. Get Linked to people in your domain. • Professional relationships rather than just friendships. FaceBook-it for friends
  • 91. My Google Scholar Profile http://scholar.google.com/citations?user=O2L8nh4AAAAJ
  • 92. “I don’t have any publications” • This is YOUR choice! Conference Abstracts.. • You produce reports, presentations and posters during your studies – share them !
  • 96. Slideshare – NOT Just Slides
  • 100. I have a set of statistics & profiles • My Blog: www.chemconnector.com • Twitter: http://twitter.com/ChemConnector • ORCID: http://orcid.org/0000-0002-2668-4821 • Amazon Author Page: Follow Link to Author Page • My Klout: http://www.klout.com/#/ChemConnector • LinkedIn: http://www.linkedin.com/in/antonywilliams • SlideShare: http://www.slideshare.net/AntonyWilliams • Google Scholar Citations Profile: Antony Williams Citations • Wikipedia : http://en.wikipedia.org/wiki/Antony_John_Williams
  • 101. The Power of Social Media
  • 102. I recommend… • Register for an ORCID ID – then use it • Develop your LinkedIn profile • Publish to Slideshare • Track Google Scholar Citations (for now) • Choose: ResearchGate or Academia.edu • Set up an About.ME page to link everything • Participate in building your profile
  • 103. Thank you Email: tony27587@gmail.com ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

Editor's Notes

  1. Toxcast can help investigate particular endpoints for a chemical – an abundance of relevant data to model.
  2. US20140329929A1, The melting point and both NMR spectra are associated with the compound. Other physical quantities e.g. volumes, pressures etc. are also detected
  3. Mostly melting points (as opposed to sublimation/decomposition). Dubious values usually mistakes in the original document e.g. in this case probably a missing hyphen.
  4. Unknown spectra are almost always hydrogen. As carbon shifts are so different to hydrogen a very crude check could partition the unknowns into proton and carbon NMR. Small numbers of other obscure spectra also found (but also false positives due to really bizarre “OCR” errors of hydrogen or the likei.e. 1 in a million errors :-p)