Developing Data Services to Support eScience/eResearch

School
of
Information

Studies

Syracuse
University

Developing
Data
Services
to

Support
eScience/eResearch

2012
Priscilla
M.
Mayden
Lecture

eScience
and
the
Evolution
of
Library
Services

Jian
Qin

School
of
Information
Studies

Syracuse
University

http://eslib.ischool.syr.edu/

February
22,
2012

School
of
Information
Studies

Syracuse
University

The
morning
ahead

An
environmental
scan

•  E-‐Science,
cyberinfrastructure,
and
data

•  What
do
all
these

have
to
do
with
me?

Case
study:
The
gravitational
wave

research
data
management

Group
work:
Role
play
in
developing

data
management
initiatives

Priscilla M. Mayden Lecture 2012, Utah 2

School
of
Information

Studies

Syracuse
University

An
environmental
scan

•  E-‐Science,
cyberinfrastructure,
and
data

•  What
do
all
these

have
to
do
with
me?

Overview
of
E-‐Science
and
Data

Characteristics
of
e-‐science

Data
sets,
data
collections,
and
data

repositories

Why
does
it
matter
to
libraries?

School
of
Information
Studies

Syracuse
University

E-‐Science

“In
the
future,
e-‐Science
will
refer
to

the
large
scale
science
that
will

increasingly
be
carried
out
through

distributed
global
collaborations

enabled
by
the
Internet.
”

National e-Science Center. (2008). Defining e-Science.
http://www.nesc.ac.uk/nesc/define.html


School
of
Information
Studies

Syracuse
University

Characteris>cs
of
e-‐science

•  Digital
data
driven

•  Distributed

•  Collaborative

•  Trans-‐disciplinary

•  Fuses
pillars
of
science

–  Experiment

–  Theory
Greer,
Chris.
(2008).
E-‐Science:
Trends,

Transformations
&
Responses.
In:

–  Model/simulation
Reinventing
Science
Librarianship:

Models
for
the
Future,
October
2008.

–  Observation/correlation
http://www.arl.org/bm~doc/
ff08greer.pps


School
of
Information
Studies

Syracuse
University

Shi?
in
Science
Paradigms

Thousand
A
few
hundred
A
few
decades
Today

years
ago
years
ago
ago

Data exploration (eScience)
unify theory, experiment, and
simulation
A computational -- Data captured by
approach instruments or generated by
simulating simulator
Theoretical complex -- Processed by software
branch phenomena -- Information/Knowledge
using models, stored in computer
generalizations -- Scientist analyzes
Science was database/files using data
empirical management and statistics
describing natural Gray,
J.
&
Szalay,
A.
(2007).
eScience
–
A
transformed
scienti_ic
method.

phenomena http://research.microsoft.com/en-‐us/um/people/gray/talks/NRC-‐CSTB_eScience.ppt


School
of
Information
Studies

Syracuse
University

Gray,
J.
&
Szalay,
A.
(2007).
eScience
–
A
transformed

scienti_ic
method.
http://research.microsoft.com/en-‐us/
X-‐Informa>cs
um/people/gray/talks/NRC-‐CSTB_eScience.ppt

•  The
evolution
of
X-‐Informatics
and
Computational-‐X

for
each
discipline
X

•  How
to
codify
and
represent
our
knowledge

Experiments &
Instruments
Other Archives facts questions
Literature facts ? answers
Simulations

The Generic Problems
•  Data
ingest

•  Query
and
Visualization
tools

•  Managing
a
petabyte
•  Building
and
executing
models

•  Common
schema
•  Integrating
data
and
Literature

•  How
to
organize
it

•  Documenting
experiments

•  How
to
reorganize
it
•  Curation
and
long-‐term
preservation

•  How
to
share
with
others


School
of
Information
Studies

Syracuse
University

Useful
resources

Part 2: Health and Wellbeing

•  The healthcare singularity and the age of
semantic medicine
•  Healthcare delivery in developing countries:
challenges and potential solutions
•  Discovering the wiring diagram of the brain
•  Toward a computational microscope for
neurobiology
•  A unified modeling approach to data-intensive
http://research.microsoft.com/en- healthcare
us/collaboration/fourthparadigm/
•  Visualization in process algebra models of
biological systems


School
of
Information
Studies

Syracuse
University

What
are
data?

What
are
some
of
the
major
data
formats?

Why
data
formats?

FUNDAMENTALS
OF
DATA


School
of
Information
Studies

Syracuse
University

What
are
data?
(1)

An
artist’s
conception
(above)
depicts
fundamental
NEON
observatory

instrumentation
and
systems
as
well
as
potential
spatial
organization
of

the
environmental
measurements
made
by
these
instruments
and

systems.
http://www.nsf.gov/pubs/2007/nsf0728/nsf0728_4.pdf


School
of
Information
Studies

Syracuse
University

What
are
data?
(2)


School
of
Information
Studies

Syracuse
University

Medical
and
health
data

Standardization

Compliance

Security

http://www.weforum.org/issues/charter-health-data

School
of
Information
Studies

Syracuse
University

The
mul>-‐dimensions
of
data

Research orientation
Data types

Data formats
Levels of
processing


School
of
Information
Studies

Syracuse
University

Scien>ﬁc
data
formats

Common
data
format

Image
formats

Matrix
formats

Microarray
_ile
formats

Communication
protocols


School
of
Information
Studies

Syracuse
University

Scien>ﬁc
&
medical
data
formats

•  Medical
and
Physiological
Data
•  Chemical
Formats

Formats

–  XYZ
—
XYZ
molecule
geometry
_ile

–  BDF
—
BioSemi
data
format
(.xyz)

(.bdf)

–  MOL
—
MDL
MOL
format
(.mol)

–  EDF
—
European
data

–  MOL2
—
Tripos
MOL2
format
(.mol2)

format
(.edf)

–  SDF
—
MDL
SDF
format
(.sdf)

•  Molecular
Biology
data
Formats

–  SMILES
—
SMILES
chemical
format

–  PDB
—
Protein
Data
Bank

(.smi)

format
(.pdb)

•  Bioinformatics
Formats

–  MMCIF
—
MMCIF
3D

molecular
model
format
(.cif)
–  GenBank
—
NCBI
GenBank
sequence

format
(.gb,
.gbk)

•  Medical
Imaging

–  FASTA
—
bioinformatics
sequence

–  DICOM
—
DICOM
annotated

format
(.fasta,
.fa,
.fsa,
.mpfa)

medical
images
(.dcm,
.dic)

–  NEXUS
—
NEXUS
phylogenetic
data

format
(.nex,
.ndk)


School
of
Information
Studies

Syracuse
University

Why
data
formats?

•  Archiving
•  Transmission

– Preservation
for
– delivery
across

posterity
• hardware

•  Storage
• software

– Availability
for
• administrative

“arbitrary”
access
– system
boundaries

•  Analysis

– availability
for

processing


School
of
Information
Studies

Syracuse
University

Summary

•  Scienti_ic
data
formats
are
closely
tied
to
scienti_ic

computing

–  Data
structure,
model,
and
attributes

–  Self-‐descriptive
with
header/metadata

–  API
for
manipulating
the
data

–  Interoperability:
conversion
between
different
formats

•  No
one-‐format-‐_its-‐all
standard

•  Each
standard
has
one
or
more
tools
for
creating,

editing,
and
annotating
dataset


School
of
Information
Studies

Syracuse
University

What
is
a
dataset?

What
are
some
of
the
metadata
standards
for
describing

datasets?

What
is
data
management?

DATASETS,
METADATA,
AND
DATA

MANAGEMENT


School
of
Information
Studies

Syracuse
University

Dataset
classiﬁca>on

Volume

Large-‐volume

Small-‐volume


School
of
Information
Studies

Syracuse
University

Ecological data example: Instantaneous streamflow by watershed
http://www.hubbardbrook.org/data/dataset.php?id=1


School
of
Information
Studies

Syracuse
University

Diabetes data
and trends—
Country level
estimates:
http://apps.nccd.cdc.gov/
DDT_STRS2/
NationalDiabetesPrevale
nceEstimates.aspx?
mode=PHY ;

Diabetes Data &
Trends home page:
http://apps.nccd.cdc.gov/
ddtstrs/default.aspx

Clinical trials data management:
School
of
Information
Studies

Syracuse
University

http://www.clinicaltrials.gov/ct2/show/NCT00006286?term=TADS
+NIMH&rank=1


School
of
Information
Studies

Syracuse
University

Common
in
the
examples

•  Attributes
of
a
dataset
tell
users/managers:

–  What
the
dataset
is
about

–  How
data
was
collected

–  To
which
project
the
data
is
related

–  Who
were
responsible
for
data
collection

–  Who
you
may
contact
to
obtain
the
data

–  What
publications
the
data
have
generated

–  ??


School
of
Information
Studies

Syracuse
University

Metadata
standards
in
medical
&
health
sciences

Structure

Semantics

Medical

Bioinfomatics
NCBI Taxonomy
Healthcare

images
NCBO Bioportal
UMLS
MeSH (Medical Subject
GenBank
Headings)
GenBank

HL7
DICOM
GenBank
SNOMED CT (Systematized
Nomenclature of Medicine--
Clinical Terms)


School
of
Information
Studies

Syracuse
University


School
of
Information
Studies

Syracuse
University

Research
data
collec>ons

Size Metadata Management
Standards

Larger,
Multiple, Organized
discipline-‐ comprehensive Institutionalized,
based

Heroic
individual
Smaller,
None or inside the
team-‐based
random team


School
of
Information
Studies

Syracuse
University

Research
collec>ons

•  Limited
processing
or
long-‐term

management
•  Not
conformed
to
any
data

standards
•  Varying
sizes
and
formats
of
data

_iles

•  Low
level
of
processing,
lack
of

plan
for
data
products

•  Low
awareness
of
metadata

standards
and
data
management

issues


School
of
Information
Studies

Syracuse
University

Resource
collec>ons

•  Authored
by
a

community
of

investigators,
within

a
domain
or
science

or
engineering

•  Developed
with

community
level

standards

•  Life
time
is
between

mid-‐
and
long-‐term


School
of
Information
Studies

Syracuse
University

Reference
collec>on

•  Example:
Global
Biodiversity
Information
Facility

–  Created
by
large
segments
of
science
community

–  Conform
to
robust,
well-‐established
and
comprehensive

standards,
e.g.

•  ABCD
(Access
to
Biological
Collection
Data)

•  Darwin
Core

•  DiGIR
(Distributed
Generic
Information
Retrieval)

•  Dublin
Core
Metadata
standard

•  GGF

(Global
Grid
Forum)

•  Invasive
Alien
Species
Pro_ile

•  LSID
(Life
Sciences
Identi_ier)

•  OGC
(Open
Geospatial
Consortium)


School
of
Information
Studies

Syracuse
University

Datasets,
data
collec>ons,
and
data

repositories

System for storing,
managing,
preserving, and
•  Data
collections
are
built
for
providing access to
larger
segments
of
science
datasets
and
engineering
Data

•  Datasets
repository

–  typically
centered
around
an
A repository may
event
or
a
study
contain one or more
–  contain
a
single
_ile
or
multiple
data collections
_iles
in
various
formats
A data collection may
–  coupled
with
documentation
contain one or more
about
the
background
of
data
datasets
collection
and
processing
A dataset may
contain one or more
Priscilla M. Mayden Lecture 2012, Utah data files 30

School
of
Information
Studies

Syracuse
University

Data
management
for
science
research

•  De_inition
from
Wikipedia:

http://en.wikipedia.org/wiki/Data_management

•  Key
concepts
in
data
management:

–  Data
ownership

–  Data
collection

–  Data
storage
How do they relate to
–  Data
protection
responsible conduct of
–  Data
retention
research?
–  Data
analysis
http://ori.hhs.gov/images/
–  Data
sharing
ddblock/data.pdf
–  Data
reporting


School
of
Information
Studies

Syracuse
University

An
aPempt
to
deﬁne
DM

•  In
the
context
of
libraries:

–  Data
management
is
a
process
in
which
librarians
plan,

design,
and
implement
data
services
to
support
eScience/
eResearch.

–  Data
services
that
libraries
may
provide:

•  Institutional
or
community
data
repositories

•  Data
management
plan
for
pre-‐
and
post-‐award
of
grants

•  Metadata
creation,
linking,
and
discovery

•  Data
archiving,
preservation,
and
curation

•  Consultation
for
research
group’s
data
management
projects

•  Data
management
and
data
literacy
training
for
graduate
students

and
faculty


School
of
Information
Studies

Syracuse
University

Ini>a>ves
in
research
libraries

Data support and Libraries involved in
services in supporting eScience:
institutions: 73%
45%
•  Pressure
points:

–  Lack
of
resources

–  Dif_iculty
acquiring
the
appropriate
staff
and

expertise
to
provide
eScience
and
data

management
or
curation
services

–  Lack
of
a
unifying
direction
on
campus

Source: Soehner, C., Steeves, C. & Ward, J. (2010). E-Science and data support services: A
study of ARL member institution. http://www.arl.org/bm~doc/escience_report2010.pdf

School
of
Information
Studies

Syracuse
University

Data
preserva>on
challenges

•  Data
formats

–  Vary
in
data
types,
e.g.
vector
and
raster
data
types

–  Format
conversions,
e.g.
from
an
old
version
to
a
newer

one

•  Data
relations

–  e.g.
there
are
data
models,
annotations,
classi_ication

schemes,
and
symbolization
_iles
for
a
digital
map

•  Semantic
issues

–  Naming
datasets
and
attributes


School
of
Information
Studies

Syracuse
University

Data
access
challenges

•  Reliability

•  Authenticity

•  Leverage
technology
to
make
data
access

easier
and
more
effective

–  Cross-‐database
search

–  Integration
applications

–  “Science-‐ready”
datasets


School
of
Information
Studies

Syracuse
University

Suppor>ng
digital
research
data

•  Lifecycle
of
research
data

–  Create:
data
creation/capture/gathering
from

laboratory
experiments,
_ield
work,
surveys,
devices,

media,
simulation
output…

–  Edit:
organize,
annotate,
clean,
_ilter…

–  Use/reuse:
analyze,
mine,
model,
derive
additional

data,
visualize,
input
to
instruments
/computers

–  Publish:
disseminate,
create,
portals
/data.

Databases,
associate
with
literature

–  Preserve/destroy:
store
/
preserve,
store
/replicate
/
preserve,
store
/
ignore,
destroy…


School
of
Information
Studies

Syracuse
University

Suppor>ng
data
management

The data deluge Researchers need:
Numerical, image, video Specialized search
engines to discover
Models, simulations, bit the data they need
streams
Powerful data mining
XML, CVS, DB, HTML tools to use and
analyze the data


School
of
Information
Studies

Syracuse
University

Research
data
management

Community
Institution
eScience

librarian

Financial and
policy support Science Data content User
domain idiosyncrasies requirements

Evolving and interconnecting –

Institutional
Community
National
International

repository
repository
repository
repository


School
of
Information
Studies

Syracuse
University

Implica>ons
to
scholarly
communica>on

process

Publishing

Curation
Archiving

Data
publishing;
Maintaining,
preserving

The
long-‐term

New
scholarly
publishing
and
adding
value
to

storage,
retrieval,
and

models—open
access,
digital
research
data

use
of
scienti_ic
data

institutional
and
throughout
its
lifecycle.

and
methods.

community

repositories,

self-‐publishing,
library

publishing,
....


School
of
Information
Studies

Syracuse
University

Summary

•  E-‐Science
development
has
raised

expectations
to
research
libraries

–  Working
knowledge
and
skills
in
e-‐Science

–  Focus
on
process
(data
and
team
science)

rather
than
product
(reference
services)

–  Proactive,
collaborative,
integrative,
and

interdisciplinary


School
of
Information

Studies

Syracuse
University

Case
Study:

Learning
Data
Management

Needs
from
Scien>sts

School
of
Information
Studies

Syracuse
University

Gravita>onal
Wave
(GW)
Research


School
of
Information
Studies

Syracuse
University

What
is
the
problem?

•  Tracking
data
output
and
work_lows
is

dif_icult
due
to
lack
of
provenance
data

•  Search
of
datasets
is
limited
due
to
lack
of

speci_ic
options

•  Within
the
LIGO
community,
data
sharing
and

reuse
is
dif_icult
without
provenance
metadata

Data provenance case study 43

School
of
Information
Studies

Syracuse
University

Understand
the
research
workﬂow

•  Interview
the
scientist

–  Listening
(good
listening
skills)

–  Asking
questions
(don’t
be
afraid
of
asking

questions)

–  Use
your
librarian
brain
to
ingest
the

conversation:

•  How
does
the
research
_low
from
one
point
to
next?

•  What
consists
of
the
research
input
and
output
at
each

stage
of
research
in
terms
of
data?


Mapping
out
the
knowledge
v0.1

School
of
Information
Studies

Syracuse
University


Mapping
out
the
knowledge
v0.2

School
of
Information
Studies

Syracuse
University


Mapping
out
the
knowledge
v1.0

School
of
Information
Studies

Syracuse
University


School
of
Information
Studies

Syracuse
University

Lessons
learned

•  Science
is
learnable
even
if
you
don’t
have
a
subject

background

–  Learn
enough
to
understand
the
research
process
and
work_low

•  Scientists
are
eager
to
get
help

•  Librarians
need
to
be
technical-‐minded

–  Data,
metadata,
database

–  Structures,
models,
work_lows

•  Librarians
need
to
be
good
listeners
while
staying
good

conversation
leaders

–  Know
when
and
how
to
lead
the
conversation
to
get
what
you

need
for
data
management
planning
and
implementation

–  Do
your
homework
on
the
subject
so
that
you
can
be
an

intelligent
listener


School
of
Information

Studies

Syracuse
University

Case
Discussions

School
of
Information
Studies

Syracuse
University

Case
Study
#1:
To
build
or
not
to
build
a
data

repository?

by the researchers in this institution.
A university library has developed an institutional repository for preserving and
providing access to the scholarly output
Now the new challenge arises from e-science research demanding data
management plan by the funding agency and the linking between publications
and data by the authors and users. You already know that some faculty use
their disciplinary data repository for submitting their datasets (e.g., GenBank for
microbiology research data). The problem you face now is whether an
institutional data repository should be built for those who do “small science” and
don’t have funding nor expertise to manage their data.

Questions to be addressed:
•  What are the strategies you will use to approach the problem?
•  What are the possible solutions for the problem?
•  What are some of the tradeoffs for the solutions you will adopt?


School
of
Information
Studies

Syracuse
University

Case
study
#2:
Developing
a
data

taxonomy

The concept of research data management is a stranger to many faculty as
well as your library staff. What is data? What is a data set? These seemingly
simple terms can be very confusing and have different interpretations in
different context and disciplines. As part of the data management strategies,
you decide to develop an authoritative data taxonomy for the campus research
community. This data taxonomy will benefit the creation and use of institutional
data policies, data repository or repositories, and data management plans
required of funding agencies.

•  What should the data taxonomy include?
•  What form should it take, a database-driven website or a static HTML page?
•  Who should be the constituencies in this process?
•  Who will be the maintainer once the taxonomy is released?


School
of
Information
Studies

Syracuse
University

Case
study
#3:
Developing
a
data
policy

Data policies play an important role in governing how the data will be managed,
shared, and accessed. It is also an instrument that will fend off potential legal
problems. Data policies have several types: data access and use, data
publishing, and data management. Your university’s Office of Sponsored
Research has some existing policy on data, but it is neither systematic nor
complete. Many of the terms were defined years ago and did not cover the new
areas such as the embargo period of data. As the university has decided to
build a data repository for managing and preserving datasets, a data policy has
become one of the top priorities for both the institution and the data repository.

•  What should the data policy include?
•  Who should be the constituencies in this process?
•  Who will be the interpretation authority for the data policy?


School
of
Information
Studies

Syracuse
University

Case
study
#4:
Cataloging
datasets

Describing datasets is the process of creating metadata for datasets. In
scientific disciplines, several metadata standards have been developed, e.g.,
the Content Standard for Digital Geospatial Metadata (CSDGM), Darwin Core,
and Ecological Metadata Language (EML). Each of these metadata standards
contains hundreds of elements and requires both metadata and subject
knowledge training in order to use them. Besides, creating one record using
any of these standards will require a tremendous time investment. But you
library does not have such specialized personnel nor have the fund to hire new
persons for the job. The existing staff has some general metadata skills such as
Dublin Core. In deciding the metadata schema for your data repository, you
need to address these questions:

•  Should I adopt a scientific metadata standard or develop one tailored to our
need?
•  How can I learn what metadata elements are critical to dataset submitters and
searchers?
•  What are some of the benefits and disadvantages for adopting a standard or
developing a local schema?


School
of
Information
Studies

Syracuse
University

Case
study
#5:
Evalua>ng
data
repository
tools

Research data as a driving force for e-science is inherently a tool-intensive
field. Tools related to data management can be divided into two broad
categories: those for creating metadata records and those for data repository
management. An academic institution decided to build their own data repository
as part of the supporting service for researchers to meet the data management
plan requirement of funding agencies. This data repository development task
was handed down to the library. You the library director have to decide whether
to develop an in-house system or use an off-the-shelf software system. As
usual, you put together a taskforce to find a solution to this challenge. The
questions to be addressed by the taskforce include:

•  What are the options available to us?
•  What evaluation criteria are the most important to our goal?
•  What are the limitations for us to adopt one option or the other?
•  How will this option be interoperate with existing institutional repository
system? Or, can the existing repository system used for data repository
purposes?


Developing Data Services to Support eScience/eResearch

More Related Content

What's hot

Similar to Developing Data Services to Support eScience/eResearch

More from Jian Qin

Recently uploaded

Developing Data Services to Support eScience/eResearch