Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Andrew Lang
Professor of Mathematics
Oral Roberts University
February 17, 2014
OSU Research Week
-Cameron Neylon
Eight committees investigated the allegations and
published reports, finding no evidence of fraud or scientific
misconduct...
Andrew Wakefield’s study,
linked the measles, mumps
and rubella vaccine to autism.
Vaccination rates in the
developed worl...
http://www.cfr.org/interactives/GH_Vaccine_Map/#map
?
Science has lost its way, at a big cost to humanity
Researchers are rewarded for splashy findings, not for double-checking...
A special challenge for science writers covering research
today arises from science’s growing credibility problem. It
stem...
trust
evidence
trust
documentation
trust
confidence
trust
reproducibility
Anything produced is released under a CC0 license:
Open Data, Open Access, Open Source.
Faster Science
failed experiments
discoverable
unexpected collaborations
real-time data and results
Faster Science
failed experiments
discoverable
unexpected collaborations
real-time data and results
Faster Science
failed experiments
discoverable
unexpected collaborations
real-time data and results
Faster Science
failed experiments
discoverable
unexpected collaborations
real-time data and results
Faster Science
failed experiments
discoverable
unexpected collaborations
real-time data and results
no insider information
reusability
reproducibility
transparency
no insider information
reusability
reproducibility
transparency
no insider information
reusability
reproducibility
transparency
no insider information
reusability
reproducibility
transparency
no insider information
reusability
reproducibility
transparency
Open Drug
Discovery for
Neglected
Diseases
malaria
schistosomiasis
gram positive bacteria
breast cancer
Drugs for neglected diseases
need to be…
cheap and…
easy to make.
docking
combinatorial
library
synthesis
solvent
selection
recrystallization
biological
assay
solubility
models
solubility ...
docking
combinatorial
library
synthesis
solvent
selection
recrystallization
biological
assay
solubility
models
solubility ...
Early models, before 2005 were…
…specialized
1979 Martin – disubstituted benzenes
1987 Hanson – normal alkanes
1988 Needham – normal and branched alkanes
...
In 2005…
…everything changed
MDPI - cheminformatics.org
Karthikeyan 2005 N = 4173, r2 = 0.65
PHYSPROP
Clark 2005 N = 6257, r2 = 0.61
Recent melting point models
use these datasets…
…never reproducing r2 = 0.65 (0.47 – 0.56)
Even though [a] melting point
can be measured accurately, its
prediction has been a
notoriously difficult problem.
We began measuring, collecting, and
curating melting points in the Fall of 2010
Jean-Claude Bradley’s
Chemical Information Retrieval
Course at Drexel
567 curated and referenced measurements from
Fall 20...
Most popular data sources…
…chemical vendors
Alfa Aesar donates ~13,000
melting points to the public domain
collection
curation
modelingvalidation
measurement
ONS
melting point
workflow
Collection: Open Data
source data points curated values source year data type
Bell 2483 1631 1995 donated-CC0
Bergstrom 27...
Curation is…
…lots of hard, tedious work
(Jean-Claude Bradley and Antony Williams)
Antony Williams – RSC ChemSpider
Inconsistencies and SMILES problems
within the “high trust level” MDPI dataset
PHYSPROP Structure Errors (Incorrect Valence)
2315 out of 43543 contained pentavalent nitrogens
PHYSPROP Errors: Structure displayed is for the neutral
compound dopamine but the associated CAS Number and
chemical name ...
unit errors: Kelvin/Celsius, Fahrenheit/Celsius
bad SMILES (non-rendering, hypervalency)
salts associated with SMILES for ...
Some melting points can’t be resolved
only with literature: 4-benzyltoluene
Open lab notebook page
measuring the melting point of 4-benzyltoluene
Melting
Point
Model
CDK
descriptor calculator
R
statistical computing
melting point data
use this model
compounds
doubleplusgood
single
CDK
descriptor calculator
R
statistical computing
Melting
Point
Model
Straight chain carboxylic acids from 1 to 10 carbons
Straight chain alcohols from 1 to 10 carbons
Comparison of model with...
Cyclic primary amines from 3 to 6 carbons
cyclobutylamine flagged for measurement
only single source available
Publication of double+ validated
melting point dataset
…as a preprint
Publication of double+ validated
melting point dataset
…as a book
Data and model deployed…
…on the web
web service
…in Google spreadsheets
…as an app
 Can the solvents used to recrystallize compounds in
organic teaching labs be improved?
 Trans-dibenzalacetone
 Aldol c...
 First recrystallized in ethyl acetate in 1906: Straus
and Ecker, Ber. 39, 2988 (1906)
 Recrystallized in ethyl acetate ...
 Recommended recrystallization solvent: ethyl acetate.
(http://classes.kvcc.edu/chm230/mixed%20aldol%20condensation.pdf
(...
Enter compound identification and desired parameters
How does it work?
1. Look up the solvent boiling point
2. Look up the room temperature solubility or predict it via measur...
Lists solvents and their predicted recrystallization yield.
Prediction is generated by the temperature dependent
solubilit...
 ethyl acetate (predicted yield of 72%) vs ethanol
(predicted yield of 93%)
 ethyl acetate
 ethanol
0.09M
1.1M
0.62M
2....
Dibenzalacetone derivatives docking against tubulin
(paclitaxel site)
 Derivatives of dibenzalacetone may be synthesized
by altering the aldehyde used
 From a library of derivatives, the fol...
 Perform a Reaxys search to determine availability
of synthesis procedures
 No results
[Matthew McBride: Undergraduate R...
 Used methanol and benzene
 Melting Point: 264-265°C
(http://usefulchem.wikispaces.com/EXP286)
[Matthew McBride: Undergr...
trust
reproducibility
open notebook science
Acknowledgements
Jean-Claude Bradley (Drexel)
Cameron Neylon (Advocacy Director at PLOS)
Antony Williams (RSC ChemSpider)
...
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Open Notebooks Science
Upcoming SlideShare
Loading in …5
×

Open Notebooks Science

6,104 views

Published on

A keynote talk I gave OSU Research Week on the importance of Open Science, especially Open Notebook Science, illustrated by practical examples. Talk inspired by Jean-Claude Bradley. Slides inspired by Cameron Neylon.

Published in: Science, Technology, Education
  • Be the first to comment

  • Be the first to like this

Open Notebooks Science

  1. 1. Andrew Lang Professor of Mathematics Oral Roberts University February 17, 2014 OSU Research Week
  2. 2. -Cameron Neylon
  3. 3. Eight committees investigated the allegations and published reports, finding no evidence of fraud or scientific misconduct. However, the reports* called on the scientists to avoid any such allegations in the future by taking steps to regain public confidence in their work, for example by opening up access to their supporting data, processing methods and software, and by promptly honouring freedom of information requests. * Archana Venkatraman, "Data Without the Doubts". Information World Review
  4. 4. Andrew Wakefield’s study, linked the measles, mumps and rubella vaccine to autism. Vaccination rates in the developed world plummeted after the study’s publication and a heated anti-vaccination movement persists today.
  5. 5. http://www.cfr.org/interactives/GH_Vaccine_Map/#map
  6. 6. ?
  7. 7. Science has lost its way, at a big cost to humanity Researchers are rewarded for splashy findings, not for double-checking accuracy. So many scientists looking for cures to diseases have been building on ideas that aren't even true. A few years ago, scientists at the Thousand Oaks biotech firm Amgen set out to double-check the results of 53 landmark papers in their fields of cancer research and blood biology. The idea was to make sure that research on which Amgen was spending millions of development dollars still held up. They figured that a few of the studies would fail the test — that the original results couldn't be reproduced because the findings were especially novel or described fresh therapeutic approaches. But what they found was startling: Of the 53 landmark papers, only six could be proved valid. http://www.latimes.com/business/la-fi-hiltzik-20131027,0,1228881.column#axzz2ix1w9zGf
  8. 8. A special challenge for science writers covering research today arises from science’s growing credibility problem. It stems from the cumulative effect of errors and exaggerations that has fueled a recent rise in retractions, misconduct, and fraud among peer-reviewed researchers. For reporters covering major scientific developments – from the search for alien life and genomics, to particle physics, climate change and cancer — it can be difficult to distinguish error from fraud, sloppiness from deception, eagerness from greed or, increasingly, scientific conviction from partisan passion. Findings in fields from climate change to vaccines can also be deceptively cherry-picked in service of a political cause.
  9. 9. trust evidence
  10. 10. trust documentation
  11. 11. trust confidence
  12. 12. trust reproducibility
  13. 13. Anything produced is released under a CC0 license: Open Data, Open Access, Open Source.
  14. 14. Faster Science failed experiments discoverable unexpected collaborations real-time data and results
  15. 15. Faster Science failed experiments discoverable unexpected collaborations real-time data and results
  16. 16. Faster Science failed experiments discoverable unexpected collaborations real-time data and results
  17. 17. Faster Science failed experiments discoverable unexpected collaborations real-time data and results
  18. 18. Faster Science failed experiments discoverable unexpected collaborations real-time data and results
  19. 19. no insider information reusability reproducibility transparency
  20. 20. no insider information reusability reproducibility transparency
  21. 21. no insider information reusability reproducibility transparency
  22. 22. no insider information reusability reproducibility transparency
  23. 23. no insider information reusability reproducibility transparency
  24. 24. Open Drug Discovery for Neglected Diseases malaria schistosomiasis gram positive bacteria breast cancer
  25. 25. Drugs for neglected diseases need to be…
  26. 26. cheap and…
  27. 27. easy to make.
  28. 28. docking combinatorial library synthesis solvent selection recrystallization biological assay solubility models solubility data melting point models melting point data The big picture
  29. 29. docking combinatorial library synthesis solvent selection recrystallization biological assay solubility models solubility data melting point models melting point data Let’s focus
  30. 30. Early models, before 2005 were…
  31. 31. …specialized 1979 Martin – disubstituted benzenes 1987 Hanson – normal alkanes 1988 Needham – normal and branched alkanes 1990 Abramowitz – non-hydrogen bonded benzenes 1991 Dearden – anilines 1993 Katritzky – aldehydes, amines, and ketones 1994 Simamora – rigid aromatic 1996 Charlton – alkanes 1996 Katritzky – pyridines 1999 Zhao – aliphatic 2001 Chickos – homologous series 2003 Bergstrom – druglike (N = 277, r2 = 0.54)
  32. 32. In 2005… …everything changed
  33. 33. MDPI - cheminformatics.org Karthikeyan 2005 N = 4173, r2 = 0.65
  34. 34. PHYSPROP Clark 2005 N = 6257, r2 = 0.61
  35. 35. Recent melting point models use these datasets… …never reproducing r2 = 0.65 (0.47 – 0.56)
  36. 36. Even though [a] melting point can be measured accurately, its prediction has been a notoriously difficult problem.
  37. 37. We began measuring, collecting, and curating melting points in the Fall of 2010
  38. 38. Jean-Claude Bradley’s Chemical Information Retrieval Course at Drexel 567 curated and referenced measurements from Fall 2010 Chemical Information Retrieval course
  39. 39. Most popular data sources… …chemical vendors
  40. 40. Alfa Aesar donates ~13,000 melting points to the public domain
  41. 41. collection curation modelingvalidation measurement ONS melting point workflow
  42. 42. Collection: Open Data source data points curated values source year data type Bell 2483 1631 1995 donated-CC0 Bergstrom 277 277 2003 open MDPI-Karthikeyan 4450 4084 2005 open Hughes 287 262 2008 open Oxford-MSDS 3217 1481 2010 open Drugbank 875 875 2011 open Griffiths 3757 278 2011 donated-CC0 Alfa Aesar 12986 8739 2011 donated-CC0 PHYSPROP 11645 9694 2011 donated-CC0 ONS 471 471 2012 open 27792 curated measurements for 19515 compounds
  43. 43. Curation is… …lots of hard, tedious work (Jean-Claude Bradley and Antony Williams) Antony Williams – RSC ChemSpider
  44. 44. Inconsistencies and SMILES problems within the “high trust level” MDPI dataset
  45. 45. PHYSPROP Structure Errors (Incorrect Valence) 2315 out of 43543 contained pentavalent nitrogens
  46. 46. PHYSPROP Errors: Structure displayed is for the neutral compound dopamine but the associated CAS Number and chemical name in the file are for the hydrobromide salt.
  47. 47. unit errors: Kelvin/Celsius, Fahrenheit/Celsius bad SMILES (non-rendering, hypervalency) salts associated with SMILES for free base using boiling point for melting point
  48. 48. Some melting points can’t be resolved only with literature: 4-benzyltoluene
  49. 49. Open lab notebook page measuring the melting point of 4-benzyltoluene
  50. 50. Melting Point Model CDK descriptor calculator R statistical computing melting point data
  51. 51. use this model
  52. 52. compounds doubleplusgood single CDK descriptor calculator R statistical computing Melting Point Model
  53. 53. Straight chain carboxylic acids from 1 to 10 carbons Straight chain alcohols from 1 to 10 carbons Comparison of model with double+ validated measurements
  54. 54. Cyclic primary amines from 3 to 6 carbons cyclobutylamine flagged for measurement only single source available
  55. 55. Publication of double+ validated melting point dataset …as a preprint
  56. 56. Publication of double+ validated melting point dataset …as a book
  57. 57. Data and model deployed… …on the web web service
  58. 58. …in Google spreadsheets
  59. 59. …as an app
  60. 60.  Can the solvents used to recrystallize compounds in organic teaching labs be improved?  Trans-dibenzalacetone  Aldol condensation between two molecules of benzaldehyde and one molecule of acetone [Matthew McBride: Undergraduate Research Assistant - Drexel]
  61. 61.  First recrystallized in ethyl acetate in 1906: Straus and Ecker, Ber. 39, 2988 (1906)  Recrystallized in ethyl acetate in Organic Syntheses
  62. 62.  Recommended recrystallization solvent: ethyl acetate. (http://classes.kvcc.edu/chm230/mixed%20aldol%20condensation.pdf (http://www.xula.edu/chemistry/documents/orgleclab/Aldol_notes.pdf)
  63. 63. Enter compound identification and desired parameters
  64. 64. How does it work? 1. Look up the solvent boiling point 2. Look up the room temperature solubility or predict it via measured or predicted Abraham descriptors 3. Look up the solute melting point or predict it via a model 4. Use the melting point and the solubility at room temperature to predict the solubility at boiling 5. Calculate the predicted recrystallization yield
  65. 65. Lists solvents and their predicted recrystallization yield. Prediction is generated by the temperature dependent solubility curves.
  66. 66.  ethyl acetate (predicted yield of 72%) vs ethanol (predicted yield of 93%)  ethyl acetate  ethanol 0.09M 1.1M 0.62M 2.06M
  67. 67. Dibenzalacetone derivatives docking against tubulin (paclitaxel site)
  68. 68.  Derivatives of dibenzalacetone may be synthesized by altering the aldehyde used  From a library of derivatives, the following compound was the top hit for the docking site of Taxol  Uses phenanthrene-9-carboxaldehyde
  69. 69.  Perform a Reaxys search to determine availability of synthesis procedures  No results [Matthew McBride: Undergraduate Research Assistant - Drexel]
  70. 70.  Used methanol and benzene  Melting Point: 264-265°C (http://usefulchem.wikispaces.com/EXP286) [Matthew McBride: Undergraduate Research Assistant - Drexel]
  71. 71. trust reproducibility open notebook science
  72. 72. Acknowledgements Jean-Claude Bradley (Drexel) Cameron Neylon (Advocacy Director at PLOS) Antony Williams (RSC ChemSpider) Drexel research assistants: Evan Curtin and Matthew McBride ORU research assistants: David Bulger, Daryl Charron, Lizzie Clark, Lacey Condron, Samantha Gaines, Alejandro Hernandez, Maria Hernandez, Jesse Patsolic, and Matthew Wilson

×