Next Generation
Scientific Publishing:
Challenges and Directions
European Bioinformatics Institute
21 June 2013
Tim Clark
Massachusetts General Hospital
MassGeneral Institute of Neurodegenerative Disease
Harvard Medical School
© 2013 Massachusetts General Hospital
Contents
• Historical background
• What is a scientific article?
• Some problems in scientific communication
• Next generation scientific publishing (NGSP)
• Taking NGSP forward
• Conclusion
Historical background
Linear document format
1665 2012
Origins of linear format
• Linear format originated pre-1665 with
personal correspondence amongst
experimentalists & mathematicians.
• 1665 scientific paper format was transported
to the Web, PDFs
• Lives in a complex ecosystem
• Incomplete Web exploitation & transition
• Tension between linear & object formats
circle @ Oxford 1640-59
circle @ Gresham College, London 1645-60
Royal Society 1660-present
“Invisible Colleges”
Scientific journals
Royal Society 1660-present
Académie des Sciences 1666-present
Jan 1665
Mar 1665
Then and now
printing
c. 1450
Scientific
Journal
1665
General
Post
Office
1660
IBM S/360 Internet Web
1964 1980s 1991
Print
culture
Web
culture
Information
Technology
the Webthe
Internet
Incomplete transition to Web
• Scientific article information model is
limited, because it is mostly narrative.
• Critical information should ideally be
computationally extractable and re-mixable.
• Yet as humans we require narratives.
• We need narratives + computable objects.
What is a scientific
article?
Definition: A scientific article is a defeasible argument
for assertions, based on a detailed narrative of
observations, which are reproducible in principle,
supported by exhibited data and supporting methods,
and contextualized with other relevant findings in the
domain. It exists in a complex ecosystem of
technologies, people and activities.
Defeasible argument
• May be challenged and proven wrong.
• May be “true” today but not tomorrow.
• Inference to best explanation (IBE),
abductive reasoning (Peirce), etc.
• Defeasible reasoning is a big topic in AI.
Exhibited data...
Philos Trans R Soc Lond 1(4):56
Brain. 2010 Nov;133
(Pt 11) 3336-3348.
(at least, enough to be convincing!)
...and reproducible
methods
Boyle’s air pump, from
New Experiments (1660)
Illumina NGS system
Scientific communications
ecosystem
Interlocking systems of activity
Some problems in
scientific
communication c. 2013
Some problems in the ecosystem
• Intractable publication volumes [1]
• Invalid, distorted and copied citations [3,4,5]
• Growing volume of retractions [5,6]
• 2/3 of retractions due to misconduct [7]
• Research non-reproducibility [8]
• Lack of transparency in publication process [9]
• Methods non-re-usability [10]
• Flawed assessment metrics [11-12]
Non-reproduciblity
11%
Begley CG and Ellis LM, Nature 2012, 483(7391):531-533
Citation distortion
adapted from supporting data, Greenberg SA, British Medical Journal 2009, 339:b2680
The copied citation
• Citation analysis of one sample of publications (in
ethnobotany) found that “the majority of citing
texts do not consider the theoretical
contributions made by the articles cited”.
• I.e., author of Work A makes statement, cites Work
B, and then copies several references, unread, from
Work B as well, assuming they are relevant too.
• Ramos et al. Scientometrics 2012, 92(3):711-719
Not to mention...
• Closed access publishing model
• Walled garden systems,
• Text mining & remixing prohibitions, and
• Insane rising costs imposed on libraries.
• Open access publishing model
• Researcher cost burden unaccounted for
by funding agencies.
Some efforts at coping
• Mandatory open access (US, UK, Universities)
• Data access: archiving and citation, institutional data
policies, “data papers”, etc. (various)
• Methods: cataloging & annotation (NIF, publishers)
• Open annotation (W3C Community) & tools
• Velocity: Alzforum, StemBook, Open Wetware, blogs,
webinars,Wikipedia coordination, etc.
• Velocity: preprint servers (ArXiv, DASH, PMC, etc.)
• Advocacy groups: FORCE11, DELSA, DORA, Amsterdam
Manifesto, etc.
Next Generation
Scientific Publishing
What does NextGen Scientific
Publishing look like?
• There is transparency of all data & methods.
• Big data + small data (the very long tail).
• Articles are deconstructable * text-minable *
remixable * computable.
• Information moves quickly and is verifiable.
• Open annotation for narrative + objects.
• There are no walled gardens: a service-
oriented open-access economy.
Data re-usability
• The main reason to exhibit data is not necessarily
to reuse it...it is (minimally) to prove that
1. you have it and are willing to show it,
2. it is reasonable to think that you derived it as you
say you did, and you openly share these methods.
• Data that is re-usable is special:
• Re-usable data is itself a research method with its
own special requirements.
• See: Data Papers.
Data papers
• Data should be surfaced in a re-usable way.
• Incentivize the extra effort required.
• Concept being developed by a few publishers
with differing implementation ideas.
• Questions: what is reusability? at what level?
Our Data Papers requirements
• Only inherently reusable data is published
as a Data Paper
• Normalize identifiers
• Reverse normal “ratio” of text:data
• Amsterdam data citation principles
• All data is searchable w/ or w/o the paper
• Global metadata catalog in stable archive
Methods re-usability
• Open methods are the basis of science.
• “Standing on the shoulders of giants” =
• reusing maths, software, instruments,
reagents, models, protocols, etc.
• But method citations can be very obscure;
• you cannot reuse a secret.
• See: alchemy, necromancy, divination.
Computational semantics
• Entity-extraction: NIF, Utopia, etc.
• Topic-based:Threads
• Statement-based: SWAN, nanopublications
• Argument-based: micropublications
Open annotation
• Open model
• Annotate any web document
• Transferable, selectively sharable
• Highlights, comments, semantics, video
• Entities, topics, statements, arguments
• W3C Open Annotation Community
• http://www.w3.org/community/openannotation/
Open annotation model
Complex annotation
Discussion as annotation
Annotation tools
Creating digital abstracts
in Domeo
Digital article summary
Digital article summary{
:MP3 rdf:type mp:Micropublication;
mp:name "MP(a3)";
mp:description "Digital summary of Spillman et al. 2010";
pav:authoredBy [ a foaf:Person ; foaf:name "Tim Clark" ];
pav:createdBy [ a foaf:Person ; foaf:name "Tim Clark" ];
pav:createdOn "2013-03-06T09:49:12-05:00"^^xsd:dateTime ;
mp:argues :C3;
mp:supportedBy <info:doi:10.1371/journal.pone.0009979> .
} .
:MP3 = {
:S1 rdf:type mp:Statement;
mp:hasContent "Rapamycin [is] an inhibitor of the mTOR pathway." ;
mp:supportedBy <info:doi/10.1038/nature08221> .
:S2 rdf:type mp:Statement;
mp:hasContent "PDAPP mice accumulate soluble and deposited Aβ and develop AD-like synaptic deficits as well as cognitive
impairment and hippocampal atrophy." ;
mp:supportedBy <info:doi/10.1073/pnas.96.6.3228> .
:S3 rdf:type mp:Statement;
mp:hasContent "Rapamycin-fed transgenic PDAPP mice showed improved learning (Figure 1a) and memory (Figure 1b). We
observed significant deficits in learning and memory in control-fed transgenic PDAPP animals." ;
mp:supportedBy <http://www.jneurosci.org/content/20/11/4050> .
:M1 rdf:type mp:Procedure;
mp:hasName "Rapamycin-supplemented mouse diet protocol" ;
mp:hasContent "We fed a rapamycin-supplemented diet... or control chow to groups of PDAPP mice and littermate non-
transgenic controls for 13 weeks. At the end of treatment (7 mo), learning and memory were tested using the Morris water maze." .
:M2 rdf:type mp:Material;
mp:hasName "PDAPP J20";
mp:hasDescription "Lennart Mucke's PDAPP J20 transgenic mice, as obtained from JAX, stock#006293" ;
mp:describedBy: <http://jaxmice.jax.org/strain/006293.html> .
:D1 rdf:type mp:Data;
pav:retrievedFrom <http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0009979#pone-0009979-g001>;
mp:supportedBy :M1, :M2 .
:C3 rdf:type mp:Claim;
mp:hasContent "Inhibition of mTOR by rapamycin can slow or block AD progression in a transgenic mouse model of the
disease." ;
mp:supportedBy :S1, :S2, :S3, :D1.
} .
Mixing nano, micro, entities, topics
Navigable citation networks
Figure from Greenberg SA, British Medical Journal 2009, 339:b2680
Taking NGSP forward
The Future of Research
Communications and
eScholarship
• Open community of scholars, librarians, archivists,
publishers and research funders.
• Goal is to facilitate more rapid change &
improvement in scholarly communications through
effective use of information technologies.
• Founded 2011 at a workshop held at Leibniz
Zentrum für Informatik, Schloss Dagstuhl, DE.
• Check it out & join online at http://force11.org
Summary
• Incomplete transition of scientific
publishing to the Web
• Big problems with the current system
• NextGen Scientific Publishing will be:
• open, transparent, remixable, fast
• and we will annotate it on the Web.
Acknowledgements
• Lab: Paolo Ciccarese, Stephane Corlosquet, Sudeshna Das, Patti
Davis, Emily Merrill, Marco Ocana
• Collaborators: Brad Allen, Neil Andrews,Anita Bandrowski, Phil
Bourne, Suzanne Brewerton, Monika Byrne, Merce Crosas,Anita
De Waard, Lisa Girard, Carole Goble,Tudor Grosza, Paul Groth,
Keith Gutfreund, Hamed Hassanzadeh, Ivan Herman, Brad
Hyman,Adrian Ivinson, Derek Marren, Maryann Martone, Pat
McCaffery, Steve Pettifer, Brock Reeve, Rob Sanderson, Holly
Schmidt, HerbertVan de Sompel and Thomas Wilkin; and our
colleagues at the Mass.Alzheimer Disease Research Center
• Funding: Eli Lilly, Elsevier, Harvard Neuro Discovery Center,
Harvard Stem Cell Institute, EMD Serono, NIH (NIA, NIDA), and
two anonymous foundations.
• Very special thanks to: Carole Goble & Brad Hyman
References
1. Hunter L, Cohen KB: Biomedical language processing: what's beyond PubMed? Molecular cell
2006, 21(5):589-594.
2. Greenberg SA: How citation distortions create unfounded authority: analysis of a citation
network. British Medical Journal 2009, 339:b2680.
3. Greenberg SA: Understanding belief using citation networks. Journal of Evaluation in Clinical Practice
2011, 17(2):389-393.
4. Ramos, M., J. Melo, and U. Albuquerque, Citation behavior in popular scientific papers: what is
behind obscure citations? The case of ethnobotany. Scientometrics, 2012. 92(3): p. 711-719.
5. Lawless J: The bad science scandal: how fact-fabrication is damaging UK's global name for
research. In: The Independent. 2013.
6. Noorden RV: Science publishing: The trouble with retractions. Nature 2011, 478:26-28.
7. Fang FC, et al: Misconduct accounts for the majority of retracted scientific publications.
Proceedings of the National Academy of Sciences 2012, 109(42):17028-17033.
8. Begley CG, Ellis LM: Drug development: Raise standards for preclinical cancer research. Nature
2012, 483(7391):531-533.
9. Marcus A, Oransky I: Bring On the Transparency Index. In: The Scientist. Midland, Ontario, CA: LabX
Media Group; 2012.
10. Bandrowski AE, et al: A hybrid human and machine resource curation pipeline for the
Neuroscience Information Framework. Database 2012: bas005.
11. Randy S, Mark P: Reforming research assessment. eLife 2013, 2.
12. Alberts B: Impact Factor Distortions. Science 2013, 340(6134):787.

Ngsp

  • 1.
    Next Generation Scientific Publishing: Challengesand Directions European Bioinformatics Institute 21 June 2013 Tim Clark Massachusetts General Hospital MassGeneral Institute of Neurodegenerative Disease Harvard Medical School © 2013 Massachusetts General Hospital
  • 2.
    Contents • Historical background •What is a scientific article? • Some problems in scientific communication • Next generation scientific publishing (NGSP) • Taking NGSP forward • Conclusion
  • 3.
  • 4.
  • 5.
    Origins of linearformat • Linear format originated pre-1665 with personal correspondence amongst experimentalists & mathematicians. • 1665 scientific paper format was transported to the Web, PDFs • Lives in a complex ecosystem • Incomplete Web exploitation & transition • Tension between linear & object formats
  • 6.
    circle @ Oxford1640-59 circle @ Gresham College, London 1645-60 Royal Society 1660-present “Invisible Colleges”
  • 7.
    Scientific journals Royal Society1660-present Académie des Sciences 1666-present Jan 1665 Mar 1665
  • 8.
    Then and now printing c.1450 Scientific Journal 1665 General Post Office 1660 IBM S/360 Internet Web 1964 1980s 1991 Print culture Web culture
  • 10.
  • 11.
    Incomplete transition toWeb • Scientific article information model is limited, because it is mostly narrative. • Critical information should ideally be computationally extractable and re-mixable. • Yet as humans we require narratives. • We need narratives + computable objects.
  • 12.
    What is ascientific article?
  • 13.
    Definition: A scientificarticle is a defeasible argument for assertions, based on a detailed narrative of observations, which are reproducible in principle, supported by exhibited data and supporting methods, and contextualized with other relevant findings in the domain. It exists in a complex ecosystem of technologies, people and activities.
  • 14.
    Defeasible argument • Maybe challenged and proven wrong. • May be “true” today but not tomorrow. • Inference to best explanation (IBE), abductive reasoning (Peirce), etc. • Defeasible reasoning is a big topic in AI.
  • 15.
    Exhibited data... Philos TransR Soc Lond 1(4):56 Brain. 2010 Nov;133 (Pt 11) 3336-3348. (at least, enough to be convincing!)
  • 16.
    ...and reproducible methods Boyle’s airpump, from New Experiments (1660) Illumina NGS system
  • 17.
  • 18.
  • 19.
  • 20.
    Some problems inthe ecosystem • Intractable publication volumes [1] • Invalid, distorted and copied citations [3,4,5] • Growing volume of retractions [5,6] • 2/3 of retractions due to misconduct [7] • Research non-reproducibility [8] • Lack of transparency in publication process [9] • Methods non-re-usability [10] • Flawed assessment metrics [11-12]
  • 21.
    Non-reproduciblity 11% Begley CG andEllis LM, Nature 2012, 483(7391):531-533
  • 22.
    Citation distortion adapted fromsupporting data, Greenberg SA, British Medical Journal 2009, 339:b2680
  • 23.
    The copied citation •Citation analysis of one sample of publications (in ethnobotany) found that “the majority of citing texts do not consider the theoretical contributions made by the articles cited”. • I.e., author of Work A makes statement, cites Work B, and then copies several references, unread, from Work B as well, assuming they are relevant too. • Ramos et al. Scientometrics 2012, 92(3):711-719
  • 24.
    Not to mention... •Closed access publishing model • Walled garden systems, • Text mining & remixing prohibitions, and • Insane rising costs imposed on libraries. • Open access publishing model • Researcher cost burden unaccounted for by funding agencies.
  • 25.
    Some efforts atcoping • Mandatory open access (US, UK, Universities) • Data access: archiving and citation, institutional data policies, “data papers”, etc. (various) • Methods: cataloging & annotation (NIF, publishers) • Open annotation (W3C Community) & tools • Velocity: Alzforum, StemBook, Open Wetware, blogs, webinars,Wikipedia coordination, etc. • Velocity: preprint servers (ArXiv, DASH, PMC, etc.) • Advocacy groups: FORCE11, DELSA, DORA, Amsterdam Manifesto, etc.
  • 26.
  • 27.
    What does NextGenScientific Publishing look like? • There is transparency of all data & methods. • Big data + small data (the very long tail). • Articles are deconstructable * text-minable * remixable * computable. • Information moves quickly and is verifiable. • Open annotation for narrative + objects. • There are no walled gardens: a service- oriented open-access economy.
  • 28.
    Data re-usability • Themain reason to exhibit data is not necessarily to reuse it...it is (minimally) to prove that 1. you have it and are willing to show it, 2. it is reasonable to think that you derived it as you say you did, and you openly share these methods. • Data that is re-usable is special: • Re-usable data is itself a research method with its own special requirements. • See: Data Papers.
  • 29.
    Data papers • Datashould be surfaced in a re-usable way. • Incentivize the extra effort required. • Concept being developed by a few publishers with differing implementation ideas. • Questions: what is reusability? at what level?
  • 30.
    Our Data Papersrequirements • Only inherently reusable data is published as a Data Paper • Normalize identifiers • Reverse normal “ratio” of text:data • Amsterdam data citation principles • All data is searchable w/ or w/o the paper • Global metadata catalog in stable archive
  • 33.
    Methods re-usability • Openmethods are the basis of science. • “Standing on the shoulders of giants” = • reusing maths, software, instruments, reagents, models, protocols, etc. • But method citations can be very obscure; • you cannot reuse a secret. • See: alchemy, necromancy, divination.
  • 35.
    Computational semantics • Entity-extraction:NIF, Utopia, etc. • Topic-based:Threads • Statement-based: SWAN, nanopublications • Argument-based: micropublications
  • 37.
    Open annotation • Openmodel • Annotate any web document • Transferable, selectively sharable • Highlights, comments, semantics, video • Entities, topics, statements, arguments • W3C Open Annotation Community • http://www.w3.org/community/openannotation/
  • 38.
  • 39.
  • 40.
  • 41.
  • 43.
  • 44.
  • 45.
    Digital article summary{ :MP3rdf:type mp:Micropublication; mp:name "MP(a3)"; mp:description "Digital summary of Spillman et al. 2010"; pav:authoredBy [ a foaf:Person ; foaf:name "Tim Clark" ]; pav:createdBy [ a foaf:Person ; foaf:name "Tim Clark" ]; pav:createdOn "2013-03-06T09:49:12-05:00"^^xsd:dateTime ; mp:argues :C3; mp:supportedBy <info:doi:10.1371/journal.pone.0009979> . } . :MP3 = { :S1 rdf:type mp:Statement; mp:hasContent "Rapamycin [is] an inhibitor of the mTOR pathway." ; mp:supportedBy <info:doi/10.1038/nature08221> . :S2 rdf:type mp:Statement; mp:hasContent "PDAPP mice accumulate soluble and deposited Aβ and develop AD-like synaptic deficits as well as cognitive impairment and hippocampal atrophy." ; mp:supportedBy <info:doi/10.1073/pnas.96.6.3228> . :S3 rdf:type mp:Statement; mp:hasContent "Rapamycin-fed transgenic PDAPP mice showed improved learning (Figure 1a) and memory (Figure 1b). We observed significant deficits in learning and memory in control-fed transgenic PDAPP animals." ; mp:supportedBy <http://www.jneurosci.org/content/20/11/4050> . :M1 rdf:type mp:Procedure; mp:hasName "Rapamycin-supplemented mouse diet protocol" ; mp:hasContent "We fed a rapamycin-supplemented diet... or control chow to groups of PDAPP mice and littermate non- transgenic controls for 13 weeks. At the end of treatment (7 mo), learning and memory were tested using the Morris water maze." . :M2 rdf:type mp:Material; mp:hasName "PDAPP J20"; mp:hasDescription "Lennart Mucke's PDAPP J20 transgenic mice, as obtained from JAX, stock#006293" ; mp:describedBy: <http://jaxmice.jax.org/strain/006293.html> . :D1 rdf:type mp:Data; pav:retrievedFrom <http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0009979#pone-0009979-g001>; mp:supportedBy :M1, :M2 . :C3 rdf:type mp:Claim; mp:hasContent "Inhibition of mTOR by rapamycin can slow or block AD progression in a transgenic mouse model of the disease." ; mp:supportedBy :S1, :S2, :S3, :D1. } .
  • 46.
    Mixing nano, micro,entities, topics
  • 47.
    Navigable citation networks Figurefrom Greenberg SA, British Medical Journal 2009, 339:b2680
  • 48.
  • 49.
    The Future ofResearch Communications and eScholarship • Open community of scholars, librarians, archivists, publishers and research funders. • Goal is to facilitate more rapid change & improvement in scholarly communications through effective use of information technologies. • Founded 2011 at a workshop held at Leibniz Zentrum für Informatik, Schloss Dagstuhl, DE. • Check it out & join online at http://force11.org
  • 50.
    Summary • Incomplete transitionof scientific publishing to the Web • Big problems with the current system • NextGen Scientific Publishing will be: • open, transparent, remixable, fast • and we will annotate it on the Web.
  • 51.
    Acknowledgements • Lab: PaoloCiccarese, Stephane Corlosquet, Sudeshna Das, Patti Davis, Emily Merrill, Marco Ocana • Collaborators: Brad Allen, Neil Andrews,Anita Bandrowski, Phil Bourne, Suzanne Brewerton, Monika Byrne, Merce Crosas,Anita De Waard, Lisa Girard, Carole Goble,Tudor Grosza, Paul Groth, Keith Gutfreund, Hamed Hassanzadeh, Ivan Herman, Brad Hyman,Adrian Ivinson, Derek Marren, Maryann Martone, Pat McCaffery, Steve Pettifer, Brock Reeve, Rob Sanderson, Holly Schmidt, HerbertVan de Sompel and Thomas Wilkin; and our colleagues at the Mass.Alzheimer Disease Research Center • Funding: Eli Lilly, Elsevier, Harvard Neuro Discovery Center, Harvard Stem Cell Institute, EMD Serono, NIH (NIA, NIDA), and two anonymous foundations. • Very special thanks to: Carole Goble & Brad Hyman
  • 52.
    References 1. Hunter L,Cohen KB: Biomedical language processing: what's beyond PubMed? Molecular cell 2006, 21(5):589-594. 2. Greenberg SA: How citation distortions create unfounded authority: analysis of a citation network. British Medical Journal 2009, 339:b2680. 3. Greenberg SA: Understanding belief using citation networks. Journal of Evaluation in Clinical Practice 2011, 17(2):389-393. 4. Ramos, M., J. Melo, and U. Albuquerque, Citation behavior in popular scientific papers: what is behind obscure citations? The case of ethnobotany. Scientometrics, 2012. 92(3): p. 711-719. 5. Lawless J: The bad science scandal: how fact-fabrication is damaging UK's global name for research. In: The Independent. 2013. 6. Noorden RV: Science publishing: The trouble with retractions. Nature 2011, 478:26-28. 7. Fang FC, et al: Misconduct accounts for the majority of retracted scientific publications. Proceedings of the National Academy of Sciences 2012, 109(42):17028-17033. 8. Begley CG, Ellis LM: Drug development: Raise standards for preclinical cancer research. Nature 2012, 483(7391):531-533. 9. Marcus A, Oransky I: Bring On the Transparency Index. In: The Scientist. Midland, Ontario, CA: LabX Media Group; 2012. 10. Bandrowski AE, et al: A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework. Database 2012: bas005. 11. Randy S, Mark P: Reforming research assessment. eLife 2013, 2. 12. Alberts B: Impact Factor Distortions. Science 2013, 340(6134):787.

Editor's Notes

  • #5 Most of us have seen this kind of slide. The scientific document began with the Philosophical Transactions and has continued in a linear document format through the transition to Web publishing. Let ’ s look at a little more of the historical context.
  • #6 The original linear format was personal correspondence between members of what were called “ Invisible Colleges ” , interlocking groups in the UK in Oxford (based at Wadham College) and London (at Gresham College); and one centered in France, around Mersenne. The Mersenne circle included Fermat, Huygens, Galileo, Pascal and Torricelli among others.