Thursday 10 May 2012
                                                 Eduserv Symposium: Big Data




JISC and the Big (Research) Data Challenge

Simon Hodson
JISC Programme Manager, Managing Research Data
Why is managing research data important?



JISC considers it a priority to support universities in improving the way
   research data is managed and, where appropriate, made available for
                                   reuse.
Research funder policies, legislative frameworks, good practice, open data
agenda
 – The outputs of publicly funded research should be publicly available.
 – The evidence underpinning research findings should be available for
   validation
Good data management is good for research
 – More efficient research process, avoidance of data loss, benefits of data reuse

Alignment with university missions.
 – Universities want to provide excellent research infrastructure.
 – Universities want to have better oversight of research outputs.
Estimated Research Data Requirements


Two Russell Group Universities
  Estimated current data holdings of c.2PB (managed and unmanaged)
  Currently provide 800TB/300TB in a central storage facility, not all of which is
  used (but will be full in 12-18 months)…
  Significant amount of data in temporary storage, external drives etc…
  ‘the more groups we go to talk to, the more we're hearing of significant
  data holdings on external hard drives and small RAID systems’
1994 Group University
  No central research data provision.
  Faculties (medicine, business, humanities) have 20-30TB each.
  Engineering currently has 170TB faculty system, urgent need to expand.
  But… one group, recently interviewed, currently has 250TB, only half in
  ‘managed storage’; will reach PB levels in the next few years.
DUDs
  The data centre
under the desk (or
 in a back pack) is
   not adequate.
Why manage research data?




Not just about storage or avoiding data loss…!
It’s about knowing what to keep and what to throw away…
Important to extract maximum return on investment from publicly
funded research.
Access to underlying data is essential for verification and therefore
research integrity.
Opportunities to extract more knowledge from existing data, new
analysis.
It’s about making the most out of data created!
Making Data Meaningful and Reusable
JISC and Research Data




1. Understanding the problem (pre-2007-2009)
2. Prototyping solutions (2009-11)
3. Hardening solutions and building institutional capacity (2011-13)
4. Developing elements of national infrastructure (2013+)
1: Understanding the Problem


Key JISC reports:
    Dealing with Data:
    http://www.ukoln.ac.uk/ukoln/staff/
    e.j.lyon/reports/dealing_with_data_
    report-final.pdf
    Keeping Research Data Safe:
    http://www.jisc.ac.uk/media/docum
    ents/publications/keepingresearch
    datasafe0408.pdf
    Skills, Role, Career Structure of
    Data Scientists and Curators:
    http://www.jisc.ac.uk/media/docum
    ents/programmes/digitalrepositorie
    s/dataskillscareersfinalreport.pdf
Other:
    UKRDS Scoping Study:
    http://www.ukrds.ac.uk/resources/
Prototyping Solutions:
                                         First MRD Programme, 2009-11



RDM Infrastructure (guidance/support, systems)



RDM Planning (DMPs, best practice, disciplinary challenges)



               RDM Training (targeted at disciplinary needs)



               Challenges of data citation and publication



First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
Building Institutional Capacity:
                                              First MRD Programme, 2009-11


RDM Infrastructure (policy, guidance/support, systems)
17 large projects




RDM Planning (DMPs, best practice, disciplinary challenges)



                     RDM Training (disciplines and libraries/research
                     support)

                     Innovative data publication


Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
Projects shortly to be announced for research data publication and developing RDM
training materials: http://bit.ly/jiscmrd-2012-Call
A holistic approach…



                          Leadership and
                        Policy Development



Publication, Citation
                                             Guidance and
  and Discovery
                                               Training
   Mechanisms




                                        Support for Data
    RDM Systems and
                                         Management
      Infrastructure
                                           Planning
How to develop RDM services
                                         Why develop services?
                                         Roles and responsibilities
      In development!                    Process of service development
                                         The components / building blocks
                                         •      Policy
                                         •      Data Management
                                         Planning
                                         •      Storage
                                         •      Data registry..... Examples and
                                                                  case studies to
                                         Getting started           develop into
                                                                      toolkit
Slide Credit: Sarah Jones and Martin Donnelly, DCC
Next steps? Elements of a national infrastructure




Journals are increasingly implementing policies requiring availability
of underlying data.
   Registry of Journal Data Policies to help researchers and research
   administrators understand the implications and changing landscape.
Universities are developing catalogues of research data holdings.
   National registry of research data to facilitate discovery, reuse; better
   understanding of impact and research landscape.
Thank You!




First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11
Programme Blog: http://researchdata.jiscinvolve.org/
MRD Project Blogs: http://tiny.cc/MRDblogs
Twitter: #jiscmrd
E-mail: s.hodson@jisc.ac.uk
Acknowledgements for slides, content: Carol Goble, Liz Lyon, Peter Murray-
Rust, David Shotton, Martin Donnelly, Sarah Jones.
From prototype to platform…




 DataFlow Project: http://www.dataflow.ox.ac.uk/




UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx
The JISC UMF DataFlow Project



     Researchers                          DataStage is a file management system
                                          A DataStage data package consists of
                                          selected data files accompanied by an
                                          RDF metadata manifest, with a SWORD
                                          v2 wrapper


    DataStage file system

                                                         Researchers, other users


                                SWORD deposit

 DataBank is a generic repository, and
 can be used to store things other that
 research datasets, for example data
 management plans (DMPs)                                 DataBank repository

Simon Hodson

  • 1.
    Thursday 10 May2012 Eduserv Symposium: Big Data JISC and the Big (Research) Data Challenge Simon Hodson JISC Programme Manager, Managing Research Data
  • 2.
    Why is managingresearch data important? JISC considers it a priority to support universities in improving the way research data is managed and, where appropriate, made available for reuse. Research funder policies, legislative frameworks, good practice, open data agenda – The outputs of publicly funded research should be publicly available. – The evidence underpinning research findings should be available for validation Good data management is good for research – More efficient research process, avoidance of data loss, benefits of data reuse Alignment with university missions. – Universities want to provide excellent research infrastructure. – Universities want to have better oversight of research outputs.
  • 3.
    Estimated Research DataRequirements Two Russell Group Universities Estimated current data holdings of c.2PB (managed and unmanaged) Currently provide 800TB/300TB in a central storage facility, not all of which is used (but will be full in 12-18 months)… Significant amount of data in temporary storage, external drives etc… ‘the more groups we go to talk to, the more we're hearing of significant data holdings on external hard drives and small RAID systems’ 1994 Group University No central research data provision. Faculties (medicine, business, humanities) have 20-30TB each. Engineering currently has 170TB faculty system, urgent need to expand. But… one group, recently interviewed, currently has 250TB, only half in ‘managed storage’; will reach PB levels in the next few years.
  • 4.
    DUDs Thedata centre under the desk (or in a back pack) is not adequate.
  • 5.
    Why manage researchdata? Not just about storage or avoiding data loss…! It’s about knowing what to keep and what to throw away… Important to extract maximum return on investment from publicly funded research. Access to underlying data is essential for verification and therefore research integrity. Opportunities to extract more knowledge from existing data, new analysis. It’s about making the most out of data created!
  • 6.
  • 7.
    JISC and ResearchData 1. Understanding the problem (pre-2007-2009) 2. Prototyping solutions (2009-11) 3. Hardening solutions and building institutional capacity (2011-13) 4. Developing elements of national infrastructure (2013+)
  • 8.
    1: Understanding theProblem Key JISC reports: Dealing with Data: http://www.ukoln.ac.uk/ukoln/staff/ e.j.lyon/reports/dealing_with_data_ report-final.pdf Keeping Research Data Safe: http://www.jisc.ac.uk/media/docum ents/publications/keepingresearch datasafe0408.pdf Skills, Role, Career Structure of Data Scientists and Curators: http://www.jisc.ac.uk/media/docum ents/programmes/digitalrepositorie s/dataskillscareersfinalreport.pdf Other: UKRDS Scoping Study: http://www.ukrds.ac.uk/resources/
  • 9.
    Prototyping Solutions: First MRD Programme, 2009-11 RDM Infrastructure (guidance/support, systems) RDM Planning (DMPs, best practice, disciplinary challenges) RDM Training (targeted at disciplinary needs) Challenges of data citation and publication First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11 JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
  • 10.
    Building Institutional Capacity: First MRD Programme, 2009-11 RDM Infrastructure (policy, guidance/support, systems) 17 large projects RDM Planning (DMPs, best practice, disciplinary challenges) RDM Training (disciplines and libraries/research support) Innovative data publication Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11 Projects shortly to be announced for research data publication and developing RDM training materials: http://bit.ly/jiscmrd-2012-Call
  • 11.
    A holistic approach… Leadership and Policy Development Publication, Citation Guidance and and Discovery Training Mechanisms Support for Data RDM Systems and Management Infrastructure Planning
  • 12.
    How to developRDM services Why develop services? Roles and responsibilities In development! Process of service development The components / building blocks • Policy • Data Management Planning • Storage • Data registry..... Examples and case studies to Getting started develop into toolkit Slide Credit: Sarah Jones and Martin Donnelly, DCC
  • 13.
    Next steps? Elementsof a national infrastructure Journals are increasingly implementing policies requiring availability of underlying data. Registry of Journal Data Policies to help researchers and research administrators understand the implications and changing landscape. Universities are developing catalogues of research data holdings. National registry of research data to facilitate discovery, reuse; better understanding of impact and research landscape.
  • 15.
    Thank You! First JISCMRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11 JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11 Programme Blog: http://researchdata.jiscinvolve.org/ MRD Project Blogs: http://tiny.cc/MRDblogs Twitter: #jiscmrd E-mail: s.hodson@jisc.ac.uk Acknowledgements for slides, content: Carol Goble, Liz Lyon, Peter Murray- Rust, David Shotton, Martin Donnelly, Sarah Jones.
  • 16.
    From prototype toplatform… DataFlow Project: http://www.dataflow.ox.ac.uk/ UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx
  • 17.
    The JISC UMFDataFlow Project Researchers DataStage is a file management system A DataStage data package consists of selected data files accompanied by an RDF metadata manifest, with a SWORD v2 wrapper DataStage file system Researchers, other users SWORD deposit DataBank is a generic repository, and can be used to store things other that research datasets, for example data management plans (DMPs) DataBank repository