Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A coordinated framework for open data open science in Botswana/Simon Hodson


Published on

Presentation during the Botswana National Forum to discuss a national Open Science Open Data policy, 30-31 October 2017.

Published in: Data & Analytics
  • Be the first to comment

A coordinated framework for open data open science in Botswana/Simon Hodson

  1. 1. A Coordinated Framework for Open Data Open Science in Botswana Simon Hodson, Executive Director, CODATA National Forum: Open Data Open Science University of Botswana Conference Centre Gaborone, Botswana 30 October 2017
  2. 2. CODATA Prospectus: Principles, Policies and Practice Capacity Building Frontiers of Data Science Data Science Journal CODATA 2017, Saint Petersburg 8-13 Oct 2017
  3. 3. INTERNATIONAL DATA WEEK IDW 2018 Gaborone, Botswana: 22–26 October 2018
  4. 4. Digital Frontiers of Global Science Frontier issues for research in a global and digital age. Applications, progress and challenges of data intensive research. Data infrastructure and enabling practices for international and collaborative research. Data, development and innovation: data as an interface between research, industry, government, society and development.
  5. 5. Why Open Science / FAIR Data? • Good scientific practice depends on communicating the evidence. • Open research data are essential for reproducibility, self-correction. • Academic publishing has not kept up with age of digital data. • Danger of an replication / evidence / credibility gap. • Boulton: to fail to communicate the data that supports scientific assertions is malpractice. • Vetterli: publishing results without also publishing the data and the code through which they were created is advertising, not science (Claerbout and Karrenbach, 1992).
  6. 6. 80% of ecology data irretrievable after 20 years (516 studies) Vines TH et al. (2013) Current Biology DOI: 10.1016/j.cub.2013.11.014 Slide credit: Dryad
  7. 7. Why Open Science / FAIR Data? • Open data practices have transformed certain areas of research. • Genomics and related biomedical sciences; crystallography; astronomy; areas of earth systems science; various disciplines using remote sensing data… • Accumulation of results and data can increase the statistical merit of findings (e.g. Cochrane collaboration). • Research data often have considerable potential for reuse, reinterpretation, use in different studies. • Most significant current research questions relating to global challenges require integration of data from many different sources and disciplines. • FAIR data helps use of data at scale, by machines, harnessing technological potential.
  8. 8. Why Open Science / FAIR Data? • Open data foster innovation and accelerate scientific discovery through reuse of data within and outside the academic system. • Research data produced by publicly funded research are a public asset, a public good. • There is a strong interface between many areas of research and application by professionals, practitioners. • A World that Counts: call for a world that knows itself better through Open Data, and uses more diverse data sources to monitor and achieve the SDGs. • Open Data and Agriculture: large number of case studies of interface between research and practitioners. • Building data capacity assists government, enterprise, development and research. • Open Science / FAIR Data accelerates these benefits,
  9. 9. Open Science  What is Open Science:  Open access to research literature.  Data that is as Open as possible, as closed as necessary.  FAIR Data (Findable, Accessible, Interoperable, Reusable).  A shop window and repository of all research outputs.  A culture and methodology of open discussion and enquiry (including methodology, lab notebooks, pre-prints)  Engagement with society and the economy in research activities (citizen science, co-design / transdisciplinary research, interface between research, development and innovation).  Research institutions have an opportunity to build their reputation around research specialisation: and this requires data specialisation and FAIR data collections.  The Open Science ethos and co-design helps build collaboration between research institutions, government agencies and industry.
  10. 10. What are data? When are data? Big Data Biggish Data Small Data Broad Data Monothematic Axis Polythematic/ Interdisciplinary Axis Complex pa erns In nature & society Patterns in space & time Low volumes Batch velocities Structured varieties Petabyte volumes High velocities Multistructured Global byte/year Yo abytes1024 Zettabytes1021 Now Exabytes 1018 Petabytes 1015 Terabytes1012 Gigabytes109 Megabytes106 Kilobytes 103  Research data, public data, private data.  Data produced by major data gathering exercises both for research and ‘monitoring’ / ‘operational data’.  Data produced by research activities (e.g. specific experiments, observations, models, surveys etc.  Data produced by government activity which has research (and other) potential.  Private sector data where there is a public good case or mutual benefit in data sharing.  Data is not the new oil. It is far more valuable because it is renewable.  Big Data poses many challenges and opportunities: how do we manage it and develop the data science to extract information?  Small data poses many challenges, because often they require detailed human-crafted metadata.  Often still our challenge is a lack of data, no data.  Data can be very varied: poses challenges of storage, management, analysis.
  11. 11. Boundaries of Open  For data created with public funds or where there is a strong demonstrable public interest, Open should be the default.  As Open as Possible as Closed as Necessary.  Proportionate exceptions for:  Legitimate commercial interests (sectoral variation)  Privacy (‘safe data’ vs Open data – the anonymisation problem)  Public interest (e.g. endangered species, archaeological sites)  Safety, security and dual use (impacts contentious)  All these boundaries are fuzzy and need to be understood better!  A great deal of data is not affected by these issues and can and should be open.  There is a need to evolve policies, practices and ethics around closed, shared, and open data.
  12. 12. Data policies and scope  Current Best Practice for Research Data Management Policies:  Framework for development of data policies; will be updated with additional resources in next 12 months by CODATA Data Policy Committee.  A number of areas requiring clarification: scope; boundaries of open; appraisal and retention…  Typology of things that are in scope of research data policies:  Data underpinning research conclusions presented the literature must be published.  Significant data produced by research projects should be made available where possible.  Data resulting from major data creation projects for monitoring and research.  Data created by public activities where there are research and development opportunities.  Clarify principles for appraisal and retention: which data should we keep?  All data should be FAIR!
  13. 13. FAIR Data • FAIR Data: increasingly widely adopted as summary of attributes that increase value of research data. • Chair of European Commission Expert Group on FAIR Data ‘Making FAIR Data a Reality’: • One of European Experts in ZA-EU Open Science Dialogue. • Member of EOSC Advisory Board. • The value of data lies in reuse. What are the attributes that make data reusable? • Findable: have sufficiently rich metadata and a unique and persistent identifier. • Accessible: retrievable by humans and machines through a standard protocol; open and free by default; authentication and authorization where necessary. • Interoperable: metadata use a ‘formal, accessible, shared, and broadly applicable language for knowledge representation’. • Reusable: metadata provide rich and accurate information; clear usage license; detailed provenance. • FAIR is augmented by the principle that data should be ‘as open as possible, as closed as necessary’, or ‘open by default’.
  14. 14. The Case for Open Data in a Big Data World • Science International Accord on Open Data in a Big Data World: • Supported by four major international science organisations. • Presents a powerful case that the profound transformations mean that data should be: • Open by default • Intelligently open, FAIR data • Lays out a framework of principles, responsibilities and enabling practices for how the vision of Open Data in a Big Data World can be achieved. • Campaign for endorsements: over 150 organisations so far. • Please consider endorsing the Accord:
  15. 15. The Open Data Iceberg The Technical Challenge The Ecosystem Challenge The Funding Challenge The Support Challenge The Skills Challenge The Incentives Challenge The Mindset Challenge Processes & Organisation People Geoffrey Boulton (CODATA) - developed from an idea by Deetjen, U., E. T. Meyer and R. Schroeder (2015). OECD Digital Economy Papers, No. 246, OECD Publishing. A(n Inter)National Infrastructure Technology
  16. 16. Establish African Open Data Forum / Platform Funded Research Data Infrastructure Initiatives Funded, co-designed transdisciplinary research projects Co-design African Open Data Policies Develop Incentives Frameworks Develop Research Data Science Training African Research Data Infrastructure Roadmap Activities require low funding for coordination, secondment, contributions in kind and evaluation. Activities require higher investment for coordination, co-design implemenatation and evaluation. African Open Science Platform Pilot Project Workpackages
  17. 17. Framework for National and Institutional Data Strategies  National / Institutional Open Science and FAIR Data Strategy  Consultative forum, stakeholder engagement.  Open data policies and guidance at national and institutional level.  Clarify the boundaries of open (particularly privacy, IPR).  Develop incentives and reward systems.  Mechanisms (infrastructure and policy) to ensure concurrent publication of data as research output.  Data ‘publication’ and citations of data included in assessment of research contribution.
  18. 18. Framework for National and Institutional Data Strategies  Promotion of data skills:  Essential data skills for researchers.  Develop skills and competencies for data stewards, data scientists.  Scope, roadmap and implement data infrastructure.  Key components of national and regional infrastructure (network / NREN, economies of scale for storage and compute).  Development of institutional infrastructure for research collaboration and data stewardship/RDM.  Collaborative infrastructures for certain research disciplines, nationally, regionally to pool expertise and lower costs.  International infrastructure / data ecosystem components: permanent identifiers, metadata standards, trusted digital repositories.
  19. 19. Data is difficult: benefits and challenges  Open and FAIR data is essential for transparency and reproducibility; to take advantage of analysis at scale; to tackle major interdisciplinary challenges that require integration of data from many resources; has significant economic and other societal benefits, including encouraging partnerships between research, government, innovation and development.  But…  Research funders and research performing institutions will have to invest in data infrastructure.  Essential to consider the cost of data stewardship and dissemination as part of the total cost of doing research.  Data description, definitions and ontologies, data management require significant effort.  Requires new data skills…  Requires a change in culture, new processes and activities…
  20. 20. Open Science and FAIR Data: Benefits for Stakeholders  Government and Innovation / Development  Increased impact from investment in activities relating to data; economic, innovation and research benefits.  Partnerships for research, development and innovation around co-design, Open Science and FAIR data.  Research Institutions:  Development of data capacity and data skills;  Not losing valuable data (stored on hard drives, not annotated or reusable);  Shop window of research activities and expertise (Open Access, Open Data / FAIR Data)  Capacity to build research schools around data assets and skills, attract international collaboration and investment.  Build case for ‘data sovereignty’, data (re-)patriation.  Researchers:  Increased data skills, expertise in FAIR data builds competitive edge.  Citation advantage of Open Access / Open Data.  Culture of certain research disciplines is already strongly in favour of Open Data / Open Science.
  21. 21. Göttingen-CODATA Symposium 18-20 March 2018  The critical role of university RDM infrastructure in transforming data to knowledge   An opportunity to share experiences, research and insights in the development implementation of RDM services in research institutions.  Special collection of Data Science Journal.  Themes: services and solutions; strategy; measuring success; skills and support; sustainability; shared services and outsourcing / consortiums; service level, trust and FAIR; champions and engaging with researchers.  Announcement of call for papers in the next week or so.  Deadline for abstracts, 15 December.
  22. 22. DataTrieste Film on Vimeo
  23. 23. INTERNATIONAL DATA WEEK IDW 2018 Gaborone, Botswana: 22–26 October 2018
  24. 24. Simon Hodson Executive Director CODATA Email: Twitter: @simonhodson99 Tel (Office): +33 1 45 25 04 96 | Tel (Cell): +33 6 86 30 42 59 CODATA (ICSU Committee on Data for Science and Technology), 5 rue Auguste Vacquerie, 75016 Paris, Thank you for your attention!
  25. 25. CODATA Strategy: Mobilising the Data Revolution Three priority areas essential to a coordinated international response to the data revolution.  Promoting implementation of open data principles, policies and practices;  Advancing the frontiers of data science and adaptation to scientific research;  Building capacity by improving data skills and the functions of science systems needed to support open data (particularly in LMICs)  A global organisation, NFP, founded by the International Council of Science to address data issues.  Member countries and participation in TGs, WGs and other initiatives comprises all continents.  Executive Committee has wide representation, includes members from Kenya and South Africa.
  26. 26. Role of CODATA National Committees  Join CODATA and form a National Committee.  CODATA membership dues are aligned with GDP.  CODATA National Committees are composed of national stakeholders and data experts.  What are the benefits of having a CODATA National Committee?  Engage: point of contact with CODATA;  Influence: contribute to CODATA strategy;  Coordinate: forum by which national stakeholders may advance data agenda in step with international developments;  Collaborate: propose Task Groups, host or participate in international workshop series, engage with Early Career Data Professionals Group;  Partner: undertake activities with other National Committees, bilaterally or in groups.
  27. 27. The Value of Open Data Sharing  Report by CODATA for GEO, the Group on Earth Observation.  Provides a concise, accessible, high level synthesis of key arguments and evidence of the benefits and value of open data sharing.  Particular, but not exclusive, reference to Earth Observation data.  Benefits in the areas of:  Economic Benefits  Social Welfare Benefits  Research and Innovation Opportunities  Education  Governance  Available at  GEO DSWG is building on this work with further examples: would be valuable to work with this community.
  28. 28. Economic Benefits of Data Sharing: LandSat  2006 Study estimated the loss in case of a data gap as equivalent to US$935 M.  2011 Study estimated benefits of landsat-sourced information for agriculture as US$858 M just for the state of Iowa.  2015 Study estimated worldwide economic benefit of US$2.19 BN.  Estimated benefit in US of US$1.8 BN.  Valuing Geospatial Information: Using the Contingent Valuation Method to Estimate the Economic Benefits of Landsat Satellite Imagery: (Paywall… Irony…)  Open data and open data infrastructure has a significant economic benefit.
  29. 29. The Value and Impact of Data Sharing and Curation: A synthesis of three recent studies of UK research data centres (Neil Beagrie and John Houghton)
  30. 30. Resources: Current Best Practice for Research Data Management Policies  Expert report commissioned by CODATA member:  Provides comprehensive summary of best practice in funder data policies.  Identifies key elements to be addressed: 1. Summary of policy drivers 2. Intelligent openness 3. Limits of openness 4. Definition of research data 5. Define data in scope 6. Criteria for selection 7. Summary of responsibilities 8. Infrastructure and costs 9. DMP requirements 10. Enabling discovery and reuse 11. Recognition and reward 12. Reporting requirements, compliance monitoring
  31. 31. FAIR Guiding Principles (1) • To be Findable: • F1. (meta)data are assigned a globally unique and persistent identifier • F2. data are described with rich metadata (defined by R1 below) • F3. metadata clearly and explicitly include the identifier of the data it describes • F4. (meta)data are registered or indexed in a searchable resource • To be Accessible: • A1. (meta)data are retrievable by their identifier using a standardized communications protocol • A1.1 the protocol is open, free, and universally implementable • A1.2 the protocol allows for an authentication and authorization procedure, where necessary • A2. metadata are accessible, even when the data are no longer available (Mons, B., et al., The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data,
  32. 32. FAIR Guiding Principles (2) • To be Interoperable: • I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. • I2. (meta)data use vocabularies that follow FAIR principles • I3. (meta)data include qualified references to other (meta)data • To be Reusable: • R1. meta(data) are richly described with a plurality of accurate and relevant attributes • R1.1. (meta)data are released with a clear and accessible data usage license • R1.2. (meta)data are associated with detailed provenance • R1.3. (meta)data meet domain-relevant community standards (Mons, B., et al., The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data,
  33. 33. What data are in scope? Criteria for preservation, publication?  Typology of things that are in scope of research data policies:  Data underpinning research conclusions presented the literature must be published.  Significant data produced by research projects should be made available.  Major data collection/creation exercises with evident multiple uses (census, Hubble etc).  Unrepeatable observations, measurements in nature or society? (preserve and publish)  Data created by in vitro experiments that can be reproduced and for which the instruments are being improved (perhaps very limited reasons for preserving)  Data collected for a given purpose (traffic management, customer relations, ships logs) but which could be used for research…  Need for collaborative approach (with researchers) to clarifying the criteria to keeping, publishing and discarding data.
  34. 34. Commission on Data Standards for Science  Major transdisciplinary research issues depend on the integration of data and information from different sources.  Fundamental importance of agreed vocabularies and standards.  Fundamental to integration of social science, geospatial and other data  Essential to effective interface of science and monitoring (e.g. Sendai, SDGs, sustainable cities)  LOD for Disaster Research, Nanomaterials Uniform Description System  Huge opportunities but significant challenges.  The ICSU and ISSC, any merged Council, and international scientific unions could have a major role to play to encourage and accelerate these developments.
  35. 35. Commission on Data Standards for Science  ‘Inter-Union Workshop on 21st Century Scientific and Technical Data Developing a roadmap for data integration’, Paris, 19-20 June:  Representatives of International Scientific Unions: IUCr for CIF; IUPAC for chemical terminologies; IUGS for GeoSciML; etc.  Representatives of Standards Organisations: e.g. Darwin Core, for biology, biodiversity; DDI for social science surveys; OGC for geospatial data; W3C for the web.  Position paper in development.  Directory of activities involving international scientific unions.  Maturity model for vocabularies and standards.  Case studies of applications of vocabularies and standards for transdiciplinary research.  Larger follow-up workshop 13-15 November, Royal Society, London.  Vision of a decadal initiative to advance science through integration of data and information.
  36. 36. CODATA-RDA School of Research Data Science • Contemporary research – particularly when addressing the most significant, transdisciplinary research challenges – increasingly depends on a range of skills relating to data. These skills include the principles and practice of Open Science and research data management and curation, the development of a range of data platforms and infrastructures, the techniques of large scale analysis, statistics, visualisation and modelling techniques, software development and data annotation. The ensemble of these skills, relating to data in research, can usefully be called ‘Research Data Science’.
  37. 37. Foundational Curriculum Seven components: open science, data management and curation; software carpentry; data carpentry; data infrastructures; statistics and machine learning; visualisation. Builds on much existing courses to create something more than the sum of its parts:  Open Science – reflection on ethos and requirements of sharing/openness  Open Research Data – Basics of data management, DMPs, RDM life-cycle, data publishing, metadata and annotation  Author Carpentry – Improving research efficiency with command line and OS tools.  Software Carpentry – Introduction the Unix shell and Git (sharing software and data)  Data Carpentry – Introduction to programming in R, and to SQL databases  Visualisation – Tools, Critical Analysis of Visualisation  Analysis – Statistics and Machine Learning (clustering, supervised and unsupervised learning)  Computational Infrastructures – Introduction to cloud computing, launching a Virtual Machine on an IaaS cloud Building international network of short courses Programme and materials: ;
  38. 38. CODATA-RDA Data Science Training Initiative • Annual foundational school hosted at ICTP, Trieste (with the objective to build a network of partners, train-the- trainers). • Advanced workshops, ICTP, Trieste, following the foundational school. • National or regional schools, organised with local partners. • Planning at least two pilot schools as part of the African Open Science Platform project: ICTP, Kigale? Gaborone for IDW? • Next #DataTrieste Summer School, 6-17 August 2018. • Next #DataTrieste Advanced Workshops 20-24 August 2018. • Next regional foundational school ‘CODATA-RDA School of Research Data Science’, São Paulo, 4-15 December 2017: