Graham Pryor

Because good research needs good data

Big data
– no big deal for curation?
Graham Pryor, Associate Director, UK Digital Curation Centre

Eduserv Symposium 2012: Big Data, Big Deal?

.

This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License

Big data – big deal or same deal?
“What need the bridge much broader than the flood?
The fairest grant is the necessity.
Look, what will serve is fit…”
Much Ado About Nothing, Act 1 Scene 1

Eduserv Symposium 2012 –
speakers’ Research Areas
• Operating Systems & Networking
• Computer and Network Security
• Distributed Systems
• Mobile Computing
• Wireless Networking
• Software Engineering
• High performance compute clusters
• Cloud and grid technologies
• Effective management of large clusters and
cluster file-systems
• Very large database systems (architecture,
management and application optimization)

The Digital Curation Centre
• a consortium comprising units from the Universities of Bath
(UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)
• launched 1st March 2004 as a national centre for solving
challenges in digital curation that could not be tackled by
any single institution or discipline
• funded by JISC to build capacity, capability and skills in
research data management across the UK HEI community
• awarded additional HEFCE funding 2011/13 for
• the provision of support to national cloud services
• targeted institutional development

Three perspectives
Scale and complexity
– Volume and pace
– Infrastructure
– Open science
Policy
– Funders
– Institutions
– Ethics & IP
Management
– Storage
– Incentives
– Costs & Sustainability
http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/

Challenges of scale and complexity
• The virtual laboratory is a federation
of server nodes that allows
• Globally, >100,000
distributed data to be stored local to
neuroscientists study the
acquisition
CNS, generating massive,
• Analysis codes can be uploaded and
intricate and highly this is only talking
But terabytes…
executed on the nodes so that
interrelated datasets
derived datasets need not be
• Analysts require access to
transported over low bandwidth
these data to develop
connections
algorithms, models and
• Data and analysis codes are
schemata that characterise
described by structured metadata,
the underlying system
providing an index for search,
• Resources and actors are
annotation and audit over workflows
rarely collocated and are
leading to scientific outcomes
therefore difficult to combine.
• Users access the distributed
resources through a web portal
emulating a PC desktop
http://www.carmen.org.uk/

Big data? – The Large Hadron Collider

Searching for the Higgs Boson

• Predicted annual generation of around 15
petabytes (15 million gigabytes) of data
• Would need >1,700,000 dual layer DVDs

Big data – the GridPP solution
Crowd sourcing for the LHC
Home and“Withcomputer users
office GridPP you
can sign up to thenever have
need LHC at home
project (based at Queen Mary,
University those data
of London), which
processing blues
makes use of idle CPU time. So
far, 40,000again…”
users in more than 100
countries have contributed the
equivalent of 3000 years on a
http://www.gridpp.ac.uk/about
single computer to the project.
With the Large Hadron Collider running at CERN the grid is
being used to process the accompanying data deluge. The UK
grid is contributing more than the equivalent of 20,000 PCs to
this worldwide effort.

Yet…..Data Preservation in High
Energy Physics?
Data from high–energy physics (HEP)
experiments are collected with significant
financial and human effort and are in many
cases unique. At the same time, HEP has no
coherent strategy for data preservation and re–
use, and many important and complex data sets
are simply lost.
David M. South, on behalf of the ICFA DPHEP Study Group
arXiv:1101.3186v1 [hep-ex]

Big data in genomics

These studies are generating
valuable datasets which, due to
their size and complexity, need to
be skilfully managed…

There’s a bigger deal than big data…
Socio- 2.
technical • Inventory data assets
management
perspectives • Profile norms, roles,
• Identify drivers and
values
champions
• Identify capability gaps
• Analyse stakeholders,
• Analyse current
issues
Information workflows
• Identify capability systems
gaps perspectives
• Assess costs,
benefits, risks
3.
Research
practice • Produce feasible,
perspectives desirable changes
• Evaluate fitness for
purpose

Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012

The DCC - building capacity and capability
through targeted institutional development
• 18 institutional engagements, 14 roadshows
• advice and assistance in strategy and policy
• use of curation tools for audit and planning
• training and skills transfer

Why do we do this?
1. Reports that researchers are often unaware
of threats and opportunities

http://www.flickr.com/photos/mattimattila/3003324844/

“Departments don’t have guidelines or
norms for personal back-up and researcher
procedure, knowledge and diligence varies
tremendously. Many have experienced
moderate to catastrophic data loss”
Incremental Project Report, June 2010

Why do we do this?
2. There is a lack of clarity in terms of skills
availability and acquisition

…researchers are
reluctant to adopt new tools and
services unless they know
someone who can recommend
or share knowledge about
them. Support needs to be
based on a close understanding
of the researchers’ work, its
patterns and timetables.

Why do we do this?
3. Many institutions are unprepared to meet
the increasingly prescriptive demands of
funders

EPSRC expects all those institutions it funds
• to have developed a roadmap aligning their policies
and processes with EPSRC’s nine expectations by
1st May 2012
• to be fully compliant with each of those expectations
by 1st May 2015
• to recognise that compliance will be monitored and
non-compliance investigated and that
• failure to share research data could result in the
imposition of sanctions

Why do we do this?
funders
4. …and legislators

Rules and regulations…

Compliance

Data Protection Act
1998
• Rights, Exemptions, Enforcement

Freedom of • Climategate, Tree Rings, Tobacco
Information Act 2000 and…(what’s next?)

Computer Misuse Act
1980
• etc. etc. etc………..

Why do we do this?
funders
4. …and legislators
5. The advantages from planning, openness
and sharing are not understood

Open to all? Case studies of openness
in research
Choices are made according to context, with
degrees of openness reached according to:
• The kinds of data to be made available
• The stage in the research process
• The groups to whom data will be made
available
• On what terms and conditions it will be
provided

Default position of most:
• YES to protocols, software, analysis tools,
methods and techniques
• NO to making research data content freely
available to everyone

After all, where is the incentive? Angus Whyte, RIN/NESTA, 2010

DCC
Institutional
Engagements

http://www.dcc.ac.uk/community/institutional-engagements
Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012

Main institutional concerns
And big data? There has been no mention
– Compliance
yet of any specific challenge from big data
– Asset management
but…
– Cost benefits
– Incentivisation
Institutions are providing resources to work
onComplexity of the data environment
– big data, both equipment and people,
and more importantly…
…the issues central to effective data
management are common across the data
spectrum, irrespective of size

Some current institutional engagements
Assessing Piloting tools
needs e.g. DataFlow

RDM roadmaps

Policy Policy
development implementation

Support offered by the DCC
Institutional
Assess data catalogues
needs Workflow
assessment Pilot RDM
tools
Develop
DAF & CARDIO DCC
assessments Guidance support
support
team and training and
services
RDM policy
Advocacy to senior development
management
Customised Data
Make the case Management Plans

…and support policy implementation

Your Data as Assets: DAF
• What are the characteristics of your
research data assets?
– Number?
– Scale?
– Complexity?
– Dependencies?
– Liabilities?
• Why do researchers act the way they do
with respect to data?
• Which data do they need to undertake
productive research?

DMP Online is a web-based data management
planning tool that allows you to build and edit plans
according to the requirements of the major UK
funders.

The tool also contains helpful guidance and links for
researchers and other data professionals.

http://www.dcc.ac.uk/dmponline

An online tool for departments or research groups to
identify their current data management capabilities
and identify coordinated pathways to future
enhancement via a dedicated knowledge base.

CARDIO emphasises a collaborative, consensus-
driven approach, and enables benchmarking with
other groups and institutions.

http://cardio.dcc.ac.uk/

DRAMBORA is an audit methodology and tool for
identifying and planning for the management of risks
which may threaten the availability and/or usability of
content in a digital repository or archive.

http://www.repositoryaudit.eu

So, big data
– no big deal for curation?
• Yes, it’s big
• It’s also very complex
• There is no single technology solution
• Issues of human infrastructure are
possibly a bigger challenge
• But for big data aficionados the
technology challenges are big enough

Data Management – infrastructure
and data storage challenges...
Scaleability
Cost-effectiveness
Security (privacy and IPR)
Robust and resilient
Low entry barrier
Ease-of-use
Data-handling / transfer /
analysis capabilities
The case for cloud computing in genome informatics.
Lincoln D Stein, May 2010

Help desk:
0131 651 1239

info@dcc.ac.uk

www.dcc.ac.uk

Graham Pryor

More Related Content

What's hot

Viewers also liked

Similar to Graham Pryor

More from Eduserv

Recently uploaded

Graham Pryor