Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data integration and building a profile for yourself as an online scientist

4,057 views

Published on

Many of us nowadays invest significant amounts of time in sharing our activities and opinions with friends and family via social networking tools. However, despite the availability of many platforms for scientists to connect and share with their peers in the scientific community the majority do not make use of these tools, despite their promise and potential impact and influence on our future careers. We are being indexed and exposed on the internet via our publications, presentations and data. We also have many more ways to contribute to science, to annotate and curate data, to “publish” in new ways, and many of these activities are as part of a growing crowdsourcing network. This presentation will provide an overview of the various types of networking and collaborative sites available to scientists and ways to expose your scientific activities online. Many of these can ultimately contribute to the developing measures of you as a scientist as identified in the new world of alternative metrics. Participating offers a great opportunity to develop a scientific profile within the community and may ultimately be very beneficial, especially to scientists early in their career.

Published in: Science
  • Be the first to comment

Data integration and building a profile for yourself as an online scientist

  1. 1. Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online Scientist Antony Williams ORCID ID:0000-0002-2668-4821
  2. 2. My background… • From 1985-present day • PhD’ed in the UK • Canadian Government lab as postdoc • Academia as NMR Facility Manager • Fortune 500 Company as Technology Leader • Start-up – product manager and CSO • Consultant – chemistry informatics industry • Entrepreneur – Created “ChemSpider” • Publisher - Royal Society of Chemistry • EPA-NCCT as cheminformatics expert
  3. 3. Of interest to faculty?
  4. 4. CASE Systems – Natural Products CH3 14.40(fb) CH3 16.80(fb) CH2 19.30(fb) CH2 21.60(fb) CH3 21.70(fb) CH2 24.40(fb) CH3 26.40 C 33.50(fb) CH3 33.50(fb) CH2 38.30(fb) CH2 38.30(fb) CH2 39.10(fb) C 39.60 CH2 40.20 CH2 42.10(fb) CH 55.50(fb) CH 56.20(fb) C 106.10 CH2 106.20(fb) CH 120.80 C 141.90 C 146.00 C 148.40 C 148.50 CH 151.30 C 153.00 NH2 N N N N O CH3 CH3 CH3 CH3 CH3 CH2 NH2 N N N N O d A ( 1 3 C ) : 1 . 7 1 9 d N ( 1 3 C ) : 2 . 0 1 6 d I ( 1 3 C ) : 2 . 3 1 3 m a x _ d A ( 1 3 C ) : 8 . 5 8 0 1 CH3 CH3 CH3 CH3 CH3CH2 NH2 N N N N O d A ( 1 3 C ) : 3 . 5 3 4 d N ( 1 3 C ) : 4 . 8 1 2 d I ( 1 3 C ) : 3 . 6 8 4 m a x _ d A ( 1 3 C ) : 1 3 . 2 8 0 2 CH3 CH3 CH3 CH3 CH3 CH2 NH2 N N N N O d A ( 1 3 C ) : 4 . 0 1 0 d N ( 1 3 C ) : 4 . 6 6 2 d I ( 1 3 C ) : 3 . 6 1 0 m a x _ d A ( 1 3 C ) : 1 2 . 2 3 0 3
  5. 5. Maybe you know this???
  6. 6. Computational Analysis at NCCT
  7. 7. Public Access and Systems
  8. 8. My Hopes for Today • Encourage you in the “era of participation” • Provide an overview of tools available • Share some stories, statistics and strategies • Encourage you to “share for the sake of science” OUTCOMES • You will claim an ORCiD • You take responsibility for your online profile • You will invest >1 hour per week
  9. 9. I would tell a chemistry joke… But all of the good ones…
  10. 10. An ambitious idea…. • Let’s map together all online chemistry data and build systems to integrate it • Heck, let’s integrate chemistry and biology data and add in disease data too if we can • Let’s extract property data and model it and see if we can extract new relationships – quantitative and qualitative • Let’s make it all available on the web…for free
  11. 11. What about this…. • We’re going to map the world • We’re going to take photos of as many places as we can and link them together • We’ll let people annotate and curate the map • Then let’s make it available free on the web • We’ll make it available for decision making • Put it on Mobile Devices, give it away…
  12. 12. Where is chemistry online? • Encyclopedic articles (Wikipedia) • Chemical vendor databases • Metabolic pathway databases • Property databases • Patents with chemical structures • Drug Discovery data • Scientific publications • Compound aggregators • Blogs/Wikis and Open Notebook Science
  13. 13. • ~35 million chemicals and growing • Data sourced from >500 different sources • Crowd sourced curation and annotation • Ongoing deposition of data from our journals and our collaborators • Structure centric hub for web-searching • …and a really big dictionary!!!
  14. 14. ChemSpider
  15. 15. ChemSpider
  16. 16. Experimental/Predicted Properties
  17. 17. Literature references
  18. 18. Patents references
  19. 19. RSC Books
  20. 20. Google Books
  21. 21. Organic Chemistry is hard…
  22. 22. …it has alkynes of trouble
  23. 23. Flavors of Chemistry
  24. 24. Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
  25. 25. Molfiles • Molfiles are the primary exchange format between structure drawing packages • Can be different between different drawing packages • Most commonly carry X,Y coordinates for layout • Can support polymers, organometallics, etc. • Can carry 3D coordinates
  26. 26. Stereo
  27. 27. Tautomeric forms
  28. 28. Chemists are good…
  29. 29. The InChI Identifier
  30. 30. InChI • SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES • InChI Strings can be reversed to structures – same problem as with SMILES – no layout • Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet
  31. 31. Multiple Layers
  32. 32. Tautomers
  33. 33. Stereo
  34. 34. InChIStrings Hash to InChIKeys
  35. 35. Structure search the web
  36. 36. Exact Search
  37. 37. Skeleton Search
  38. 38. Data Quality/Standardization • MANY structures meant to be something online are MISREPRESENTED. • Commonly you will have better success finding information by name searches than structure – with many caveats of course… • Validating chemical structure representations is laborious work – and it’s shocking to review data…
  39. 39. Data Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011)Science Translational Medicine 2011
  40. 40. Data quality is a known issue
  41. 41. Data quality is a known issue
  42. 42. Patent data in public databases
  43. 43. Patent data in public databases
  44. 44. You just can’t trust atoms!
  45. 45. Depiction vs Accurate Representation
  46. 46. Depiction vs Accurate Representation
  47. 47. What is the Structure of Vitamin K1?
  48. 48. Date Quality Issues and $$$$
  49. 49. Many Names, One Structure
  50. 50. But big and often noisy
  51. 51. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  52. 52. Text Mining on IUPAC Names The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  53. 53. Name to Structure Conversion
  54. 54. Name to Structure Conversion
  55. 55. What could we get?
  56. 56. PhysChem first: Melting Points • Melting/sublimation/decomposition points extracted for 287,635 distinct compounds from 1976-2014 USPTO patent applications/grants • Sanity checks used to flag dubious values – probably 130-4°C • Non-melting outcomes recorded e.g. mp 147- 150°C. (subl.) • What models could be built?
  57. 57. Modeling “BIG data” • Melting point models developed with ca. 300k compounds • Required 34Gb memory and about 400MB disk space (zipped) • Matrix with 2*1011 entries (300k molecules x 700k descriptors) • >12k core-hours (>600 CPU-days) for parameter optimization • Parallelized on > 600 cores with up to 24 cores per one task • Consensus model as average of individual models • Accuracy of consensus model is ~33.6 °C for drug-like region compounds • Models publicly available at http://ochem.eu
  58. 58. A Recent Talk http://www.slideshare.net/AntonyWilliams/
  59. 59. ESI – Text Spectra
  60. 60. ChemSpider ID 24528095 H1 NMR
  61. 61. We want to find text spectra? • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC) • What would be better are spectral figures – and include assignments where possible!
  62. 62. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  63. 63. NMR Spectra • 2,316,005 distinct spectra in 2001-2015 USPTO Nucleus Count H 1993384 C 173970 Unknown 107439 F 22158 P 16333 B 980 Si 715 Pt 275 N 170 V 101
  64. 64. ESI Data also contains figures
  65. 65. “Where is the real data please?” FIGURE DATA
  66. 66. Data added to ChemSpider
  67. 67. Visibility Means Discoverability • Q: Does a Social Profile as a scientist matter? • You are visible, when you share your skills, experience and research activities by: • Establishing a public profile • Getting on the record • Collaborative Science • Demonstrating a skill set • Measured using “alternative metrics” • Contributing to the public peer review process • There are many ways to become “visible”
  68. 68. Scientists measured by Impact
  69. 69. How to Measure Impact
  70. 70. Your Research Outputs? • Research datasets • Scientific software • Publications – peer-reviewed and many others • Posters and presentations at conferences • Electronic theses and dissertations • Performances in film and audio • Lectures, online classes and teaching activities • What else??? • The possibilities to share are endless
  71. 71. Open Researcher & Contributor ID
  72. 72. Here’s why they are useful…
  73. 73. Wonderful Profile…
  74. 74. CONTRIBUTE to the community • Share your expertise in the new world of open • Share your Figures, share your data • Contribute to Wikis – Wikipedia and others • Participate in Open Notebook Science • Build tools and platforms to support chemists • Curate, use and comment on data • Get engaged on blogs and discussions
  75. 75. Oxidation by Sodium Hydride?
  76. 76. The Blogosphere Analyzes…
  77. 77. The Blogosphere Analyzes…
  78. 78. The new world of micropublishing
  79. 79. ChemSpider SyntheticPages
  80. 80. Micropublishing with Peer Review (a chemical synthesis blog?)
  81. 81. Multi-Step Synthesis
  82. 82. Interactive Data
  83. 83. You should be LinkedIn • LinkedIn for “professionals” • Expose work history, skills, your professional interests, your memberships – your profile WILL be watched! • Who you are linked to says a lot about who you are. Get Linked to people in your domain. • Professional relationships rather than just friendships. FaceBook-it for friends
  84. 84. LinkedIn http://www.linkedin.com/in/AntonyWilliams
  85. 85. My Career Captured…
  86. 86. And “Endorsements”
  87. 87. Highlight “Projects”
  88. 88. Manage Articles Here Too.
  89. 89. …and presentations
  90. 90. My Google Scholar Profile http://scholar.google.com/citations?user=O2L8nh4AAAAJ
  91. 91. “I don’t have any publications” • This is YOUR choice! Conference Abstracts.. • You produce reports, presentations and posters during your studies – share them !
  92. 92. Slideshare – Highly Accessed
  93. 93. Slideshare – EXPANDED Audience
  94. 94. Fast Network Communication
  95. 95. Slideshare – NOT Just Slides
  96. 96. ResearchGate https://www.researchgate.net/profile/Antony_Williams
  97. 97. ResearchGate
  98. 98. ResearchGate
  99. 99. I have a set of statistics & profiles • My Blog: www.chemconnector.com • Twitter: http://twitter.com/ChemConnector • ORCID: http://orcid.org/0000-0002-2668-4821 • Amazon Author Page: Follow Link to Author Page • My Klout: http://www.klout.com/#/ChemConnector • LinkedIn: http://www.linkedin.com/in/antonywilliams • SlideShare: http://www.slideshare.net/AntonyWilliams • Google Scholar Citations Profile: Antony Williams Citations • Wikipedia : http://en.wikipedia.org/wiki/Antony_John_Williams
  100. 100. The Power of Social Media
  101. 101. I recommend… • Register for an ORCID ID – then use it • Develop your LinkedIn profile • Publish to Slideshare • Track Google Scholar Citations (for now) • Choose: ResearchGate or Academia.edu • Set up an About.ME page to link everything • Participate in building your profile
  102. 102. Thank you Email: tony27587@gmail.com ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

×