This document discusses building an online profile as a scientist in the era of big data and open science. It begins with an overview of the speaker's background working in academia, industry, and as an entrepreneur. The speaker then discusses various online tools and platforms that scientists can use to share their work and expertise, such as ORCID, LinkedIn, Google Scholar, SlideShare, and ResearchGate. He emphasizes the importance of making contributions openly available online in order to increase visibility and measure impact through alternative metrics. The speaker also provides examples of using these tools to showcase his own career and publications.
Data integration and building a profile for yourself as an online scientist
1. Chemical Information in the Big Data
Era: Data Quality, Data Integration
and Building a Profile for Yourself as
an Online Scientist
Antony Williams
ORCID ID:0000-0002-2668-4821
2. My background…
• From 1985-present day
• PhD’ed in the UK
• Canadian Government lab as postdoc
• Academia as NMR Facility Manager
• Fortune 500 Company as Technology Leader
• Start-up – product manager and CSO
• Consultant – chemistry informatics industry
• Entrepreneur – Created “ChemSpider”
• Publisher - Royal Society of Chemistry
• EPA-NCCT as cheminformatics expert
4. CASE Systems – Natural Products
CH3 14.40(fb)
CH3 16.80(fb)
CH2 19.30(fb)
CH2 21.60(fb)
CH3
21.70(fb)
CH2
24.40(fb)
CH3 26.40
C
33.50(fb)
CH3 33.50(fb)
CH2
38.30(fb) CH2
38.30(fb)
CH2
39.10(fb)
C
39.60
CH2 40.20
CH2 42.10(fb)
CH
55.50(fb) CH
56.20(fb)
C
106.10
CH2 106.20(fb)
CH
120.80 C
141.90
C
146.00
C
148.40
C
148.50
CH
151.30
C
153.00
NH2
N N N N O
CH3
CH3
CH3
CH3
CH3
CH2
NH2
N
N
N
N
O
d A ( 1 3 C ) : 1 . 7 1 9
d N ( 1 3 C ) : 2 . 0 1 6
d I ( 1 3 C ) : 2 . 3 1 3
m a x _ d A ( 1 3 C ) : 8 . 5 8 0
1
CH3
CH3
CH3
CH3
CH3CH2
NH2
N
N
N
N
O
d A ( 1 3 C ) : 3 . 5 3 4
d N ( 1 3 C ) : 4 . 8 1 2
d I ( 1 3 C ) : 3 . 6 8 4
m a x _ d A ( 1 3 C ) : 1 3 . 2 8 0
2
CH3
CH3
CH3
CH3
CH3
CH2
NH2
N
N
N N
O
d A ( 1 3 C ) : 4 . 0 1 0
d N ( 1 3 C ) : 4 . 6 6 2
d I ( 1 3 C ) : 3 . 6 1 0
m a x _ d A ( 1 3 C ) : 1 2 . 2 3 0
3
8. My Hopes for Today
• Encourage you in the “era of participation”
• Provide an overview of tools available
• Share some stories, statistics and strategies
• Encourage you to “share for the sake of science”
OUTCOMES
• You will claim an ORCiD
• You take responsibility for your online profile
• You will invest >1 hour per week
9. I would tell a chemistry joke…
But all of the good ones…
10. An ambitious idea….
• Let’s map together all online chemistry data
and build systems to integrate it
• Heck, let’s integrate chemistry and biology
data and add in disease data too if we can
• Let’s extract property data and model it and
see if we can extract new relationships –
quantitative and qualitative
• Let’s make it all available on the web…for
free
11.
12. What about this….
• We’re going to map the world
• We’re going to take photos of as many
places as we can and link them together
• We’ll let people annotate and curate the map
• Then let’s make it available free on the web
• We’ll make it available for decision making
• Put it on Mobile Devices, give it away…
13. Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
• Property databases
• Patents with chemical structures
• Drug Discovery data
• Scientific publications
• Compound aggregators
• Blogs/Wikis and Open Notebook Science
14. • ~35 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our journals
and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
26. Molfiles
• Molfiles are the primary exchange format
between structure drawing packages
• Can be different between different drawing
packages
• Most commonly carry X,Y coordinates for
layout
• Can support polymers, organometallics, etc.
• Can carry 3D coordinates
31. InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages. No
variability as with SMILES
• InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
• Adopted by the community (databases,
blogs, Wikipedia) – good for searching the
internet
39. Data Quality/Standardization
• MANY structures meant to be something online
are MISREPRESENTED.
• Commonly you will have better success finding
information by name searches than structure –
with many caveats of course…
• Validating chemical structure representations is
laborious work – and it’s shocking to review
data…
52. Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
53. Text Mining on IUPAC Names
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
57. PhysChem first: Melting Points
• Melting/sublimation/decomposition points
extracted for 287,635 distinct compounds from
1976-2014 USPTO patent applications/grants
• Sanity checks used to flag dubious values –
probably 130-4°C
• Non-melting outcomes recorded e.g. mp 147-
150°C. (subl.)
• What models could be built?
58. Modeling “BIG data”
• Melting point models developed with ca. 300k compounds
• Required 34Gb memory and about 400MB disk space (zipped)
• Matrix with 2*1011
entries (300k molecules x 700k descriptors)
• >12k core-hours (>600 CPU-days) for parameter optimization
• Parallelized on > 600 cores with up to 24 cores per one task
• Consensus model as average of individual models
• Accuracy of consensus model is ~33.6 °C for drug-like region
compounds
• Models publicly available at http://ochem.eu
62. We want to find text spectra?
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
• What would be better are spectral figures – and
include assignments where possible!
64. NMR Spectra
• 2,316,005 distinct spectra in 2001-2015 USPTO
Nucleus Count
H 1993384
C 173970
Unknown 107439
F 22158
P 16333
B 980
Si 715
Pt 275
N 170
V 101
68. Visibility Means Discoverability
• Q: Does a Social Profile as a scientist matter?
• You are visible, when you share your skills,
experience and research activities by:
• Establishing a public profile
• Getting on the record
• Collaborative Science
• Demonstrating a skill set
• Measured using “alternative metrics”
• Contributing to the public peer review process
• There are many ways to become “visible”
71. Your Research Outputs?
• Research datasets
• Scientific software
• Publications – peer-reviewed and many others
• Posters and presentations at conferences
• Electronic theses and dissertations
• Performances in film and audio
• Lectures, online classes and teaching activities
• What else???
• The possibilities to share are endless
75. CONTRIBUTE to the
community
• Share your expertise in the new world of open
• Share your Figures, share your data
• Contribute to Wikis – Wikipedia and others
• Participate in Open Notebook Science
• Build tools and platforms to support chemists
• Curate, use and comment on data
• Get engaged on blogs and discussions
84. You should be LinkedIn
• LinkedIn for “professionals”
• Expose work history, skills, your professional
interests, your memberships – your profile
WILL be watched!
• Who you are linked to says a lot about who
you are. Get Linked to people in your
domain.
• Professional relationships rather than just
friendships. FaceBook-it for friends
91. My Google Scholar Profile
http://scholar.google.com/citations?user=O2L8nh4AAAAJ
92. “I don’t have any publications”
• This is YOUR choice! Conference Abstracts..
• You produce reports, presentations and
posters during your studies – share them !
100. I have a set of statistics & profiles
• My Blog: www.chemconnector.com
• Twitter: http://twitter.com/ChemConnector
• ORCID: http://orcid.org/0000-0002-2668-4821
• Amazon Author Page: Follow Link to Author Page
• My Klout: http://www.klout.com/#/ChemConnector
• LinkedIn: http://www.linkedin.com/in/antonywilliams
• SlideShare: http://www.slideshare.net/AntonyWilliams
• Google Scholar Citations Profile: Antony Williams Citations
• Wikipedia : http://en.wikipedia.org/wiki/Antony_John_Williams
102. I recommend…
• Register for an ORCID ID – then use it
• Develop your LinkedIn profile
• Publish to Slideshare
• Track Google Scholar Citations (for now)
• Choose: ResearchGate or Academia.edu
• Set up an About.ME page to link everything
• Participate in building your profile
Toxcast can help investigate particular endpoints for a chemical – an abundance of relevant data to model.
US20140329929A1, The melting point and both NMR spectra are associated with the compound. Other physical quantities e.g. volumes, pressures etc. are also detected
Mostly melting points (as opposed to sublimation/decomposition). Dubious values usually mistakes in the original document e.g. in this case probably a missing hyphen.
Unknown spectra are almost always hydrogen. As carbon shifts are so different to hydrogen a very crude check could partition the unknowns into proton and carbon NMR. Small numbers of other obscure spectra also found (but also false positives due to really bizarre “OCR” errors of hydrogen or the likei.e. 1 in a million errors :-p)