1st LEARN Workshop. Embedding Research Data as part of the research cycle. 29 Jan 2016. Presentation by Geoffrey Boulton, University of Edinburgh & CODATA
6. A crisis of reproducibility and credibility?
Why such low levels of reproducibility?
• Misconduct/fraud
• Invalid reasoning
• Absent or inadequate data and/or metadata
10. Reinventing reproducibility
for the digital age
How do we retain an essential principle?
The data providing the evidence for a published
concept MUST be concurrently published, together
with necessary metadata and computer code.
To do otherwise is scientific MALPRACTICE
11.
12. Ozone Levels
Four key drivers of change for science
• Big data
• Semantically-linked data
• Open data
• Cost reduction
Micro-satellite
Looking at clouds
13. Pillars of the Digital Revolution
Big Data
Volume
Velocity
Variety
Linked
Open
Data
Many
databases
Semantic
Relations
Deeper
meaning
Foundations : Openness
Machine analysis & learning Text and data mining
14. The opportunity: data from “simple” to complex systems
from uncoupled to highly coupled behaviour
Uncoupled
systems
Simulating behaviour of
highly coupled systems
15. Simulating system dynamics Mapping a complex state
Image of brain cells in a rat
Emergent behaviour of a specific
6-component coupled system
• patterns not hitherto seen
• unsuspected relationship
• complex systems
e.g. complexity: dynamic evolution and system state
Scientific opportunities
16. Satellite observation Surface monitoring
The opportunity: data-modelling: iterative integration
Initial conditions
Model forecast
Model-data iteration - forecast correction
17. Linear regression
Cluster analysis
Dynamic/complex behaviour
Complex systems
No mathematical pipeline
Simple relationships
Classical statistics
System characterisations: from simple to complex
Glucose in type II diabetes
Topological analysis
18. A barrier to openness? - Analytic overload.
E.g. - Global Earth Observation System of Systems
• What is the human role?
• Can we analyse & scrutinise what is in the
black box? - &who owns the box?
• What does it mean to be a researcher in a
data intensive age?
A disconnect between machine
analysis & human cognition?
19. Mathematics related discussions
Tim Gowers
- crowd-sourced mathematics
An unsolved problem posed on
his blog.
32 days – 27 people – 800
substantive contributions
Emerging contributions rapidly
developed or discarded
Problem solved!
“Its like driving a car whilst
normal research is like pushing
it”
What inhibits such processes?
- The criteria for credit and
promotion
– ALTMETRICS THE ANSWER?
New modes of technology-
enabled creativity:
e.g Crowd-sourcing
20. The Open Data Iceberg
The Technical Challenge
The Consent Challenge
The Ecosystem Challenge
The Funding Challenge
The Support Challenge
The Skills Challenge
The Incentives Challenge
The Mindset Challenge
Processes &
Organisation
People
motivation and ethos.
Developed from: Deetjen, U., E. T. Meyer and R. Schroeder (2015).
A National Infrastructure
Technology
21. The “Science International” Accord:
principles of open data
(www.icsu.org/science-international)
Responsibilities
1-2. Scientists
3. Research institutions & universities
4. Publishers
5. Funding agencies
6. Scholarly societies and academies
7. Libraries & repositories
8. Boundaries of openness
Enabling practices
9. Citation and provenance
10. Interoperability
11. Non-restrictive re-use
12. Linkability
22. Responsibilities
Scientists
i. Publicly funded scientists have a responsibility to contribute to the
public good through the creation and communication of new
knowledge, of which associated data are intrinsic parts. They
should make such data openly available to others as soon as
possible after their production in ways that permit them to be re-
used and re-purposed.
ii. The data that provide evidence for published scientific claims
should be made concurrently and publicly available in an
intelligently open form. This should permit the logic of the link
between data and claim to be rigorously scrutinised and the
validity of the data to be tested by replication of experiments or
observations. To the extent possible, data should be deposited in
well-managed and trusted repositories with low access barriers.
23. CODATACODATA
II
SS
UU
African Open Data/Open Science Platform
Platform Forum
Coordination
Government
Priority setting
Funders
Funding
Incentives
Capacity Building
Training and Skills
Infrastructure
Roadmaps
Flagship
Co-Designed Data
Intensive Projects
International
Standards
Programmes
Shared infrastructure investment; shared good practice; capacity building;
system development
24. EMBL-EBI services
Labs around the
world send us
their data and
we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise
Disciplinary communities can lead the way
e.g. Elixir programme in life sciences/bio-informatics
25. Regional Platforms for Open Science
African
Platform?
Asian
Platform?
Australian
Platform
Shared investment in infrastructure; harvesting and circulating good ideas;
spreading and supporting good practice; capacity building; promoting
applications; linking to international programmes and standards.
S.
American
Platform?
26. Inputs Outputs
Open access
Administrative
data (held by
public
authorities e.g.
prescription
data)
Public Sector
Research data
(e.g. Met
Office weather
data)
Research
Data (e.g.
CERN,
generated in
universities)
Research
publications
(i.e. papers in
journals)
Open data
Open science
“science as a public enterprise”
Collecting the
data
Doing
research
Doing science
openly
Researchers - Govt & Public sector - Businesses - Citizens - Citizen scientists
(communication/dialogue – joint production of knowledge)
Stakeholders
• Communication/dialogue must be audience-sensitive
• Is it – with all stakeholder groups?
27. Open Science
Data / Publications
Researchers
Mono/MultiInterTransdisciplinary
Stakeholders
RigourInnovationPolicySolutions
Open Knowledge
28. Ins tu onal
management and support
Na onal policies
& e-infrastructure
Open
Research
Data
Big Data
Analy cs
Knowledge
Output
EXPLOITING THE DATA REVOLUTION
Scien fic inference
Ins tu onal
management & support
Na onal policies
& e-infrastructure
A national data-intensive system
29. CODATACODATA
II
SS
UU
International Research Data Collaboration
CODATACODATA
II
SS
UU
CODATA
Policies & practice
Frontiers of data
science
Capacity Building
WDS
• Data stewardship
• Data standards
RDA
• Interoperability
30. 1. Maintaining “self-correction”
2. Open knowledge is creative & productive
“If you have an apple and I have an apple and we
exchange these apples, then you and I will still
each have one apple. But if you have an idea and I
have an idea and we exchange these ideas, then
each of us will have two ideas.”
3. Open data enables semantic linking
George Bernard Shaw
Why openness & sharing?
31. • Openly collected science is already helping policy
makers.
• AshTag app allows users to submit photos and
locations of sightings to a team who will refer them on
to the Forestry Commission, which is leading efforts to
stop the disease's spread with the Department for
Environment, Food and Rural Affairs (Defra).
Chalara spread: 1992-2012
Citizen Science
Editor's Notes
The material advance of human society has been based on the acquisition and use of knowledge and science, as it has been practised in the last 300 years has proved to be the most effective way of gaining reliable knowledge. I want to talk about the processes whereby science is done and how they need to adapt to a novel environment in which we are able to acquire, store, manipulate and communicate data of unprecedented volume and complexity. What challenges does this environment offer to the essential processes of science, how can we exploit the opportunities that it offers and what barriers inhibit necessary changes. This is not about openness for itself – but open processes in the doing of science) Open science is not new. It was the bedrock on which the extraordinary scientific revolutions of the 18th and 19th centuries were built. But we do need to reinvent it for a data-rich era. So let us start with a little history.
This is Henry Oldenberg, the first secretary of the newly formed Royal Society in the early 1660s. Henry was an inveterate correspondent, with those we would now call scientists both in Europe and beyond. Rather than keep this correspondence private, he thought it would be a good idea to publish it, and persuaded the new Society to do so by creating the Philosophical Transactions, which remains a top-flight journal to the present day. But he demanded two things of his correspondents: that they should submit in the vernacular and not Latin; and that evidence (data) that supported a concept must be published together with the concept. It permitted others to scrutinize the logic of the concept, the extent to which it was supported by the data and permitted replication and re-use. Open publication of concept and evidence is the basis of “scientific self-correction”, which historians of science argue were the crucial building blocks on which the scientific revolution of the 18th and 19th centuries was built and remain fundamental to the progress of science. Openness to scrutiny by scientific peers is the most powerful form of peer review.
The fundamental challenge is to scientific self-correction. Journals can no longer contain the data, and neither scientists nor journals have taken the obvious step of having data relevant to a publication concurrently available in an electronic database. (example of last year’s Nature paper revealing that only 11% of results in 50 benchmark papers in pre-clinical oncology were replicable. If lack of Oldenburg’s rigour in presenting evidence is widespread, a failure of replicability risks undermines science as a reliable way of acquiring knowledge and can therefore undermines its credibility.
Lots of interchangeable and fluid terms but many shared principles.
The word “science” is used to mean the systematic organisation of knowledge that can be rationally explained and reliably applied. It is not exclusively restricted to “natural science”.
Human and technical requirements for a sustainable data infrastructure.
Network of world data centres.
Data policies and data science: bringing data experts together with research scientists.
Ash dieback, caused by the fungus Chalara fraxinea, was found in the UK in October outside of plantations and nurseries in East Anglia, raising fears of a repeat of Dutch elm disease which killed 25 million mature elms in the 1970s and 80s. In an attempt to map and help prevent the spread of the disease across the country, a team of developers and academics worked through the weekend to create an app that smartphone owners can use to report suspected cases of infection. Infected ash trees are recognisable by lesions on their bark, dieback of leaves at the tree's crown, and leaves turning brown – though experts say the arrival of autumn makes the latter harder to accurately spot. zThe AshTag app for IOS and Android devices allows users to submit photos and locations of sightings to a team who will refer them on to the Forestry Commission, which is leading efforts to stop the disease's spread with the Department for Environment, Food and Rural Affairs (Defra).