2. Data
life
cycle
The
life
cycle
of
data
depends
on
Project
aims
and
purpose.
Planning/
project
design
Finding/crea2ng
the
data
Extrac2ng
Transforming
and
Loading
Processing
Analyzing
data
–informa2on
–
publica2on
Data
associated
with
study
can
be
reused
4. Data
access
and
data
sharing
• What
do
you
expect
when
we
access
data?
• What
do
you
expect
when
we
share
data?
• These
are
two
sides
of
the
same
coin
5. Open
access
data
policy
• Data
created
from
research
are
valuable
resources
that
can
be
used
and
reused
for
future
scien2fic
and
educa2onal
purposes.
Sharing
data
facilitates
new
scien2fic
inquiry,
avoids
duplicate
data
collec2on
and
provides
real
life
resources
for
educa2on
and
training
OR
• Publicly
funded
research
data
should
be
as
far
as
possible
openly
available
to
the
scien2fic
community
6. What
does
this
achieve
• Encourages
scien2fic
enquiry
and
debate
• Promotes
innova2on
and
poten2al
new
data
uses
• New
collabora2ons
between
users
and
creators
of
data
• Maximises
transperancy
and
accoun2bility
• Enables
scru2ny
of
research
findings
• Encourages
improvement
and
valida2on
of
research
findings
• Reduces
cost
of
supplica2ng
data
collec2on
• Increases
visibility
of
research
• Provides
direct
credit
to
researcher
• Research
outcome
for
educa2on
and
training
7. Encouraged
by
• Research
funders
under
guidance
from
OECD
have
developed
data
sharing
policies
that
allow
researches
2me
for
exclusive
use
of
data
for
a
limited
2me
with
a
mandate
to
publish
at
the
end
of
agreed
period.
This
can
be
done
via
repositories
or
data
centers.
The
funders
also
require
data
management
and
sharing
plan
• Journals
data
that
forms
basis
of
publica2on
needs
to
be
shared
or
deposited
within
an
accessible
accessible
database
or
repository.
• Ini2a2ves
like
DataCite
registry
assign
Unique
digital
object
iden2fiers
DOIs
to
research
data
helping
scien2st
make
data
discoverable,
citable
and
tracable
so
research
data
as
well
as
publica2on
based
on
those
data
form
part
of
scien2fic
output.
• Use
of
Metadata
dependent
URIs
to
iden2fy
and
share
data
8. How
to
share
/
access
data
• Specialist
data
centers,
archives
or
data
banks
• Journal
to
support
publica2on
• Ins2tu2onal
repository
• Online
via
project
or
ins2tu2onal
website
• Informally
between
researchers
on
a
peer-‐
to-‐peer
basis
URI
iden2fies
data
9. Advantages
of
deposi2ng
data
with
data
center
or
repository
• Assurance
that
data
meets
set
standards
• Long
term
preserva2on
of
standardised
accessible
data
format,
format
conversion
when
so_ware
upgraded
• Safe
keeping
with
a`ribu2on
in
secure
environment
• Regular
data
backup
• Online
resource
discovery
through
catalogues
• Access
in
popular
formats
• Licensing
arrangement
to
acknowledge
data
rights
• Standardised
cita2on
mechanism
to
acknowledge
data
ownership
• Pormo2on
of
data
to
many
users
• Monitoring
secondary
usage
of
data
• Management
of
access
to
data
and
user
queries
on
behalf
of
data
owner
10. So
we
need
to
share
data
&
Shared
data
is
available
to
us
11. What
affects
Sharing/Accessing
data
Size
of
data
and
compute
Community
developed
of
data
standards
Exis2ng
repositories
or
storage
facili2es
Nature
of
data
Appropriate
data
tracking
and
governance
Key
management
points
Metadata
12. Size
of
data
Decides
what
kind
of
storage/
archival
is
used
Cloud
storage
OK
for
data
that
does
not
go
into
terabytes
or
does
not
have
restric2ons
Cost
implica2ons
Available
as
DaaS,
SaaS,
PaaS,
IaaS
Sta2c
storage:
Cluster
based
compu2ng/storage
Geographical
restric2ons
Provides
compute
for
analysis
since
big
data
does
not
move.
Good
access
control?
13. Compute
for
analysis
• Once
there
is
data,
access
decision
needs
to
be
made
on
how
much
compute
is
required
for
analysis.
• Cloud
based
solu2ons
are
available
for
small
scale
data
• Data
centers
like
Aimes
allow
for
compute
on
clusters
• Ins2tute/repository
may
provide
HPC
as
well
as
so_ware
for
analysis
14. Community
developed
data
standards
An
ac2ve
collabora2ve
community
is
essen2al
for
development
of
community
standards
The
standards
are
required
for
format/s
for
data
storage/exchange
vocabulary
for
data
representa2on
Absence
of
Community
standards?
Catalogues
can
be
found
at:
h`p://www.ebi.ac.uk/ols/index
h`p://bioportal.bioontology.org/
15. Exis2ng
data
repositories/storage
• Topic
specific
repositories
will
give
maximum
exposure
to
the
data
/
access
to
relevant
data
• Issue
with
mul2ple
repositories
–
collabora2ve
approaches
to
repositories
eg.
RCSB
for
structure
data
• Absence
of
repositories
??
• h`p://datacite.org/repolist
• h`p://databib.org
16. Nature
of
data
• This
decides
whether
the
data
can
be
open
access
or
controlled
access.
• There
may
be
further
geographical
restric2on
on
the
data.
• If
controlled
access
is
required
there
is
a
need
for
development
of
Data
Access
Agreements
&
Applica2on
Forms.
• Management
of
the
access
control
17. Approaches
to
secure
access
• DAC
controlled
access
but
with
/
without
monitoring
• Highly
controlled
access
where
only
analysis
results
can
be
taken
away
-‐
Datasheild
18. Roles
and
responsibili2es
Par2cularly
important
where
sensi2ve
data,
personal
data
or
patent
data
are
involved.
Appropriate
consents
and
ethics
need
to
be
in
place
Some2mes
only
processed
ananomized
data
can
be
used.
• Requires
the
establishment
of
DAC
and
MC
– Manages
applica2ons
– Approves
applica2ons
– Manages
access
– Manages
destruc2on
of
data
if
required
20. Data
management
planning
• Plan
ahead
to
create
high
–
quality
and
sustainable
data
that
can
be
shared
• This
will
need
checking
periodically
to
see
that
the
plan
s2ll
meets
requirements
Available
resources:
h`ps://dmponline.dcc.ac.uk
h"p://www.mrc.ac.uk/documents/doc/data-‐
management-‐plan-‐template/
22. Metadata
• What
is
metadata?
– Documenta2on
and
descrip2on
associate
with
data
– Required
to
make
sense
of
the
data
eg
descrip2on
of
variables,
classifica2on
scheme,
dates
and
project..
There
are
Metadata
standards
Eg.
Dublin
core,
Darwin
core,
OECD
minimal
data
set,
AGROVOC
23.
24. Forma2ng
your
data
• Different
formats
good
for
different
purposes
• Open
formats
adopted
by
community
are
more
sustainable
eg.
Re,
2f,
vaw,
xml,
csv
• Proprietary
and/or
compressed
formats
that
have
widespread
use
eg.
Doc,
jpg,
mp3,
gzip
• Organising
files
and
folders
• Quality
assurance
• Version
control
and
authen2city
transcrip2on
Available
resources
25.
26. Storing
your
data
• Keep
your
digital
data
safe
secure
and
recoverable
• Making
backups
at
least
2
• Ins2tu2onal
back-‐up
policies
• Manage
backups:
snapshots,
integrity,
recoverability
• Data
storage
strategy
• Data
security
• Security
of
personal
data
• Data
destruc2on
/
disposal
• Data
transmission
and
encryp2on
• File
sharing
and
collabora2ve
environment
-‐
email,
dropbox,
_p,
encrypted
media,
file
store,
VRES
..
29. Resources
for
archiving
data
• Dryad
—
Dryad
is
an
interna2onal
repository
of
data
underlying
peer-‐reviewed
ar2cles
in
the
basic
and
applied
biosciences.
• The
Dataverse
Network
—
The
Dataverse
Network
is
an
open
source
applica2on
to
publish,
share,
reference,
extract
and
analyze
research
data.
(Harvard)
30. Destroy
data
• Physical
destruc2on
• Overwri2ng
• Demagne2sing
the
storage
• Disc
distruc2on
• Purging
the
printers
and
other
devices
31. Best
Prac2ces
• Make
DMP
• Use
standard
vocabulary
• Standardised
format
• Check
ins2tu2onal
policy
for
data
storage
and
exchange
• Check
funders
policy
for
data
exchange
• Check
legal
constraints
and
requirements.
• Make
data
available
under
DAA
• Wri`en
policy
for
reten2on
and
disposal
of
data
• Safe
and
secure
sharing
of
data
32. Strategies
for
centers
• Provide
management
framework
for
researchers
Some
sources
are:
UK
data
archive
Boston
university
Melbourne
Data
Cura2on
Center