Building Communities and Services in
Support of Data-Intensive Research
Stephen Abrams
University of California Curation Center
California Digital Library
August 20, 2013
Topics
 Data curation
 UC3 services
 DMPTool
 DataUp
 EZID
 Merritt
 WAS
 Collaborative initiatives
 DataShare
 Research Hub
 Conclusions
Why is data curation important?
 Integrity
 Enabling appropriate scrutiny, debate, reproduction, and
verification of results
 Efficiency
 Avoiding needless duplication of effort
 Policy
 Complying with institutional policies, publication requirements,
and funder mandates
“*Data] is a valuable national asset whose value is multiplied when it is made
easily accessible to the public”
– Office of Science and Technology Policy
Why is data curation important?
 Catalyzing
 Promoting progress through new collaborations and creative
(re)use of data
“If I have seen further it is by standing on the shoulders of giants”
– Isaac Newton, 1676
What is the library’s role?
 A continuation of its long-standing mission and practice to
connect patrons with content of interest in meaningful ways
across barriers of space and time
Cf. Tenopir et al. (2012), “Academic librarians and research data services: Preparation and attitudes,” 78th
IFLA General Conference and Assembly, Helsinki, http://conference.ifla.org/past/ifla78/116-tenopir-en.pdf
 Offering solutions that enhance the natural points of
alignment between the scholarly research and information
lifecycles
Publish
Reuse
ShareCreate
Discover
Collect
PreserveAccessResearchResearch CurationCuration
Scholarly lifecycle Information lifecycle
Why is data curation hard?
 Ever increasing number, size, and diversity of content
 Inevitability of disruptive change
 Resources not keeping pace with growth
 Stakeholders outside of traditional cultural heritage domains,
with lots of questions
 Who can give me advice on what I should do?
 How should I describe and package my data?
 How can I cite my data in order to receive
credit for it?
 How can I share my data?
 What can I do with web published data?
…
DMPTool – guidance and resources
Finalist, 2012
DPC Award for
Research and
Innovation
http://dmptool.org/
 Create, edit, and share data
management plans
 Meet funder requirements
 Provide institutional guidance
 Links to local resources
DMPTool – guidance and resources
Finalist, 2012
DPC Award for
Research and
Innovation
http://dmptool.org/
 Create, edit, and share data
management plans
 Meet funder requirements
 Provide institutional guidance
 Links to local resources
DMPTool – guidance and resources
Two recently
funded projects
 Functional
enhancements
and open source
community
development
Sloan Foundation
 Training and
outreach
IMLS
http://dmptool.org/
 New options for DMP
collaboration and formal
and ad hoc review
 Stronger administrative
control and customization
DataUp – description and packaging
http://dataup.cdlib.org/ http://www.dataup.org/
“It’s easier to augment systems than
it is to change behavior”
Curation for tabular datasets
 Excel add-in
 Azure cloud service
DataUp – description and packaging
http://dataup.cdlib.org/ http://www.dataup.org/
 Best practices check
 Data description
 Identifier and citation generation
 Repository submission to
ONEShare
Curation for tabular datasets
 Excel add-in
 Azure cloud service
DataUp – description and packaging
http://dataup.cdlib.org/ http://www.dataup.org/
What researchers don’t need to know
 Schema definition and XML syntax
 Identifier registration procedures
 Citation format
 Repository packaging and submission
 Harvesting for aggregation
2013 Innovation Award winner
Recently funded project
 Functional enhancements and open
source community development
NSF
EZID – identification and citation
http://n2t.net/ezid/
UC3 is a founding
member of the
DataCite consortium
 Mint DOI and
ARK
 Add descriptive
metadata
 Receive QR code
 Global resolution
 Aggregated
discovery
 Updatable
resolution URLs
 Establish and maintain persistent two-way
linkages between the literature and the data
that underlies its results
EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
 Mint DOI and
ARK
 Add descriptive
metadata
 Receive QR code
 Global resolution
 Updatable
resolution URLs
 Link to dataset in repository
http://n2t.net/ezid/
EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
 Mint DOI and
ARK
 Add descriptive
metadata
 Receive QR code
 Global resolution
 Updatable
resolution URLs
 Link from dataset landing page to article
citing the data
EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
 Mint DOI and
ARK
 Add descriptive
metadata
 Receive QR code
 Global resolution
 Updatable
resolution URLs
 Link from article back to dataset
EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
 Aggregated discovery via DataShare and Ex Libris Primo
 Later this year, aggregation via T-R Data Citation Index
EZID – identification and citation
UC3 is a founding
member of the
DataCite consortium
 SEI for public visibility in leading search engines
Merritt – preservation and access
 Content agnostic,
model free
 Micro-service
architecture
 UI and RESTful API
 26 curatorial units
 271 collections
 325,000 objects
 450,000 versions
 4,500,000 files
 13 TB
http://merritt.cdlib.org/
 Enforceable Data Use Agreements (DUAs) in
response to concerns over potential loss of
control over dissemination and reuse
Open to the UC
community and
external partners
 Dark archive for
long-term
assurance
 Bright archive for
sharing
 Integration with
preservation grids
 Integration with
public access
portals
 Integration with
CMS
Merritt – preservation and access
 Content agnostic,
model free
 Micro-service
architecture
 UI and RESTful API
 26 curatorial units
 271 collections
 325,000 objects
 450,000 versions
 4,500,000 files
 13 TB
 For curatorially-designated collections and
objects, a download request triggers …
Open to the UC
community and
external partners
 Dark archive for
long-term
assurance
 Bright archive for
sharing
 Integration with
preservation grids
 Integration with
public access
portals
 Integration with
CMS
Merritt – preservation and access
 Content agnostic,
model free
 Micro-service
architecture
 UI and RESTful API
 26 curatorial units
 271 collections
 325,000 objects
 450,000 versions
 4,500,000 files
 13 TB
Open to the UC
community and
external partners
 Dark archive for
assurance
 Bright archive for
sharing
 Integration with
preservation grids
 Integration with
public access
portals
 Integration with
CMS
 Click-through DUA; acceptance of terms of
use triggers …
Merritt – preservation and access
 Content agnostic,
model free
 Micro-service
architecture
 UI and RESTful API
 26 curatorial units
 271 collections
 325,000 objects
 450,000 versions
 4,500,000 files
 13 TB
Open to the UC
community and
external partners
 Dark archive for
assurance
 Bright archive for
sharing
 Integration with
preservation grids
 Integration with
public access
portals
 Integration with
CMS
From: no-reply-merritt@ucop.edu
Subject:Merritt DUA acceptance
Name: Stephen Abrams
Affiliation: California Digital Library
Collection: UCSF DataShare
Object: Frontotemporal Lobar Degeneration (FTLD)
Date: 2013-05-3109:50:34PDT
Terms of use: As part of this agreement, Consumer submits to the following
statements:
(1) I will receive access to de-identified data and will not attempt to establish the
identity of any of the study subjects.
(2) I will share these data only with my immediate co-workers, and I will not transfer
these data to other research groups. I understand that these data are available to
other research groups through the process by which I obtain them.
(3) I will require anyone in my group who utilizes these data, or anyone with whom I
share these data to comply with this data use agreement
...
 Email notification to consumer and curator
 Delivery of requested content
Web Archiving Service
http://was.cdlib.org/
 Collect, describe,
manage, preserve,
and provide access
to web sites
 Analysis tools
 Full-text search
 27 curatorial units
 185 collections
 10,772 web sites
 97,121 captures
 64 TB
“You can’t study life
in our time without
the Internet, so we
must preserve it”
– René Vourburg, KB
 Initially developed
as part of the
NDIIPP-funded Web
at Risk project
 The web has become the publication platform
of choice
 Source of important primary and secondary
research data
Web Archiving Service
http://was.cdlib.org/
 Collect, describe,
manage, preserve,
and provide access
to web sites
 Analysis tools
 Full-text search
 27 curatorial units
 185 collections
 10,772 web sites
 97,121 captures
 64 TB
“You can’t study life
in our time without
the Internet, so we
must preserve it”
– René Vourburg, KB
 Initially developed
as part of the
NDIIPP-funded Web
at Risk project
 For example, California water district web sites
supplement UC Davis source water assessment
and protection (SWAP) Merritt collections
Connecting to communities of practice
 Engage with new user communities where and how they
already work
 Shifting user roles, shifting expectations
 Institutional  individual researcher
 Behavioral expectations set by the commercial/mobile web
DataShare – catalyzing science
 UCSF Clinical and Translational Science Institute
http://ctsi.ucsf.edu/
 UCSF Library
http://www.library.ucsf.edu/
 UCSF Center for Imaging of Neurodegenerative Disease
http://www.radiology.ucsf.edu/cind/
http://datashare.ucsf.edu/
“Making data transparent
and available is going to
accelerate all of science;
it's a relatively
inexpensive way to get
more value out of all of
the work that we do”
– Michael Weiner, UCSF
 Pilot project in
biomedical imaging
“The goal is to
catalyze widespread
sharing of scientific
research data”
 Prepare
 Describe
 Upload
 Curate
 Discover
 Share
DataShare – catalyzing science
 UCSF-developed submission client, supporting intuitive
drag & drop operation and metadata entry
 EZID for DOIs; Merritt for preservation
 XTF-based faceted search/browse portal
http://xtf.cdlib.org/
http://datashare.ucsf.edu/
“Making data transparent
and available is going to
accelerate all of science;
it's a relatively
inexpensive way to get
more value out of all of
the work that we do”
– Michael Weiner, UCSF
 Pilot project in
biomedical imaging
“The goal is to
catalyze widespread
sharing of scientific
research data”
 Prepare
 Describe
 Upload
 Curate
 Discover
 Share
Research Hub – content mgmt and collaboration
 3,900 users
 770 projects
 Alfresco CMS
 Desktop sync
 Mobile apps
 Abode Creative
Suite
 Personal file
management
 Project
collaboration
 Departmental
resource pooling
 Research data
sharing
“Powerful tools for
management and
collaboration”
 Create
 Organize and enrich
 Keep safe
 Share
http://hub.berkeley.edu/
 UC Berkeley Information Services &Technologies
http://ist.berkeley.edu/
Research Hub – content mgmt and collaboration
 3,900 users
 770 projects
 Alfresco CMS
 Desktop sync
 Mobile apps
 Abode Creative
Suite
 Personal file
management
 Project
collaboration
 Departmental
resource pooling
 Research data
sharing
“Powerful tools for
management and
collaboration”
 Create
 Organize and enrich
 Keep safe
 Share
http://hub.berkeley.edu/
 Primary discovery and access via Research Hub
 EZID for DOIs; Merritt for preservation
 Merritt access called for in succession plans
Data curation
“Access to and sharing of data are essential for the conduct and
advancement of science”
— Arzberger et al. (2004), “Promoting access to public research data for scientific, economic,
and social development,” Data Science Journal 3: 135-52, doi:10.2481/dsj.3.135
 Pro-active curation of research outputs is necessary to ensure
their ongoing viability and use
 Good for research; good for researchers
 Quicker, more innovative science; higher impact factor
 Increasingly necessary for conformance to institutional
policies, publication requirements, and funder mandates
Data curation
 Widespread adoption is dependent on outreach, education,
and minimal intrusion into existing disciplinary workflows and
common community practices
 The most effective – and sustainable – curation services are
composed from best-of-breed components
 Libraries are a natural curation partner for the research
community
For more information
 UC Curation Center
http://www.cdlib.org/uc3/
uc3@ucop.edu
Stephen Abrams David Loy
Patricia Cruse Mark Reyes
Shirin Faenza Joan Starr
Scott Fisher Carly Strasser
Erik Hetzner Marisa Strong
Joshua Hubbard Bhavitavya Vedula
Greg Janée Kenneth Weiss
John Kunze Perry Willet
Rosalie Lack
 DataShare
http://datashare.ucsf.edu/
Geoffrey Boushey Megan Laurance
Anirvan Chatterjee Angela Rizk-Jackson
Maninder Kahlon Michael Weiner
Julia Kochi
 Research Hub
http://hub.berkeley.edu/
Ian Crew Patrick McGrath
Michael McCarthy Noah Wittman

Uc3 pasig-asis&t-2013-08-20-support-of-data-intensive-research

  • 1.
    Building Communities andServices in Support of Data-Intensive Research Stephen Abrams University of California Curation Center California Digital Library August 20, 2013
  • 2.
    Topics  Data curation UC3 services  DMPTool  DataUp  EZID  Merritt  WAS  Collaborative initiatives  DataShare  Research Hub  Conclusions
  • 3.
    Why is datacuration important?  Integrity  Enabling appropriate scrutiny, debate, reproduction, and verification of results  Efficiency  Avoiding needless duplication of effort  Policy  Complying with institutional policies, publication requirements, and funder mandates “*Data] is a valuable national asset whose value is multiplied when it is made easily accessible to the public” – Office of Science and Technology Policy
  • 4.
    Why is datacuration important?  Catalyzing  Promoting progress through new collaborations and creative (re)use of data “If I have seen further it is by standing on the shoulders of giants” – Isaac Newton, 1676
  • 5.
    What is thelibrary’s role?  A continuation of its long-standing mission and practice to connect patrons with content of interest in meaningful ways across barriers of space and time Cf. Tenopir et al. (2012), “Academic librarians and research data services: Preparation and attitudes,” 78th IFLA General Conference and Assembly, Helsinki, http://conference.ifla.org/past/ifla78/116-tenopir-en.pdf  Offering solutions that enhance the natural points of alignment between the scholarly research and information lifecycles Publish Reuse ShareCreate Discover Collect PreserveAccessResearchResearch CurationCuration Scholarly lifecycle Information lifecycle
  • 6.
    Why is datacuration hard?  Ever increasing number, size, and diversity of content  Inevitability of disruptive change  Resources not keeping pace with growth  Stakeholders outside of traditional cultural heritage domains, with lots of questions  Who can give me advice on what I should do?  How should I describe and package my data?  How can I cite my data in order to receive credit for it?  How can I share my data?  What can I do with web published data? …
  • 7.
    DMPTool – guidanceand resources Finalist, 2012 DPC Award for Research and Innovation http://dmptool.org/  Create, edit, and share data management plans  Meet funder requirements  Provide institutional guidance  Links to local resources
  • 8.
    DMPTool – guidanceand resources Finalist, 2012 DPC Award for Research and Innovation http://dmptool.org/  Create, edit, and share data management plans  Meet funder requirements  Provide institutional guidance  Links to local resources
  • 9.
    DMPTool – guidanceand resources Two recently funded projects  Functional enhancements and open source community development Sloan Foundation  Training and outreach IMLS http://dmptool.org/  New options for DMP collaboration and formal and ad hoc review  Stronger administrative control and customization
  • 10.
    DataUp – descriptionand packaging http://dataup.cdlib.org/ http://www.dataup.org/ “It’s easier to augment systems than it is to change behavior” Curation for tabular datasets  Excel add-in  Azure cloud service
  • 11.
    DataUp – descriptionand packaging http://dataup.cdlib.org/ http://www.dataup.org/  Best practices check  Data description  Identifier and citation generation  Repository submission to ONEShare Curation for tabular datasets  Excel add-in  Azure cloud service
  • 12.
    DataUp – descriptionand packaging http://dataup.cdlib.org/ http://www.dataup.org/ What researchers don’t need to know  Schema definition and XML syntax  Identifier registration procedures  Citation format  Repository packaging and submission  Harvesting for aggregation 2013 Innovation Award winner Recently funded project  Functional enhancements and open source community development NSF
  • 13.
    EZID – identificationand citation http://n2t.net/ezid/ UC3 is a founding member of the DataCite consortium  Mint DOI and ARK  Add descriptive metadata  Receive QR code  Global resolution  Aggregated discovery  Updatable resolution URLs  Establish and maintain persistent two-way linkages between the literature and the data that underlies its results
  • 14.
    EZID – identificationand citation UC3 is a founding member of the DataCite consortium  Mint DOI and ARK  Add descriptive metadata  Receive QR code  Global resolution  Updatable resolution URLs  Link to dataset in repository http://n2t.net/ezid/
  • 15.
    EZID – identificationand citation UC3 is a founding member of the DataCite consortium  Mint DOI and ARK  Add descriptive metadata  Receive QR code  Global resolution  Updatable resolution URLs  Link from dataset landing page to article citing the data
  • 16.
    EZID – identificationand citation UC3 is a founding member of the DataCite consortium  Mint DOI and ARK  Add descriptive metadata  Receive QR code  Global resolution  Updatable resolution URLs  Link from article back to dataset
  • 17.
    EZID – identificationand citation UC3 is a founding member of the DataCite consortium  Aggregated discovery via DataShare and Ex Libris Primo  Later this year, aggregation via T-R Data Citation Index
  • 18.
    EZID – identificationand citation UC3 is a founding member of the DataCite consortium  SEI for public visibility in leading search engines
  • 19.
    Merritt – preservationand access  Content agnostic, model free  Micro-service architecture  UI and RESTful API  26 curatorial units  271 collections  325,000 objects  450,000 versions  4,500,000 files  13 TB http://merritt.cdlib.org/  Enforceable Data Use Agreements (DUAs) in response to concerns over potential loss of control over dissemination and reuse Open to the UC community and external partners  Dark archive for long-term assurance  Bright archive for sharing  Integration with preservation grids  Integration with public access portals  Integration with CMS
  • 20.
    Merritt – preservationand access  Content agnostic, model free  Micro-service architecture  UI and RESTful API  26 curatorial units  271 collections  325,000 objects  450,000 versions  4,500,000 files  13 TB  For curatorially-designated collections and objects, a download request triggers … Open to the UC community and external partners  Dark archive for long-term assurance  Bright archive for sharing  Integration with preservation grids  Integration with public access portals  Integration with CMS
  • 21.
    Merritt – preservationand access  Content agnostic, model free  Micro-service architecture  UI and RESTful API  26 curatorial units  271 collections  325,000 objects  450,000 versions  4,500,000 files  13 TB Open to the UC community and external partners  Dark archive for assurance  Bright archive for sharing  Integration with preservation grids  Integration with public access portals  Integration with CMS  Click-through DUA; acceptance of terms of use triggers …
  • 22.
    Merritt – preservationand access  Content agnostic, model free  Micro-service architecture  UI and RESTful API  26 curatorial units  271 collections  325,000 objects  450,000 versions  4,500,000 files  13 TB Open to the UC community and external partners  Dark archive for assurance  Bright archive for sharing  Integration with preservation grids  Integration with public access portals  Integration with CMS From: no-reply-merritt@ucop.edu Subject:Merritt DUA acceptance Name: Stephen Abrams Affiliation: California Digital Library Collection: UCSF DataShare Object: Frontotemporal Lobar Degeneration (FTLD) Date: 2013-05-3109:50:34PDT Terms of use: As part of this agreement, Consumer submits to the following statements: (1) I will receive access to de-identified data and will not attempt to establish the identity of any of the study subjects. (2) I will share these data only with my immediate co-workers, and I will not transfer these data to other research groups. I understand that these data are available to other research groups through the process by which I obtain them. (3) I will require anyone in my group who utilizes these data, or anyone with whom I share these data to comply with this data use agreement ...  Email notification to consumer and curator  Delivery of requested content
  • 23.
    Web Archiving Service http://was.cdlib.org/ Collect, describe, manage, preserve, and provide access to web sites  Analysis tools  Full-text search  27 curatorial units  185 collections  10,772 web sites  97,121 captures  64 TB “You can’t study life in our time without the Internet, so we must preserve it” – René Vourburg, KB  Initially developed as part of the NDIIPP-funded Web at Risk project  The web has become the publication platform of choice  Source of important primary and secondary research data
  • 24.
    Web Archiving Service http://was.cdlib.org/ Collect, describe, manage, preserve, and provide access to web sites  Analysis tools  Full-text search  27 curatorial units  185 collections  10,772 web sites  97,121 captures  64 TB “You can’t study life in our time without the Internet, so we must preserve it” – René Vourburg, KB  Initially developed as part of the NDIIPP-funded Web at Risk project  For example, California water district web sites supplement UC Davis source water assessment and protection (SWAP) Merritt collections
  • 25.
    Connecting to communitiesof practice  Engage with new user communities where and how they already work  Shifting user roles, shifting expectations  Institutional  individual researcher  Behavioral expectations set by the commercial/mobile web
  • 26.
    DataShare – catalyzingscience  UCSF Clinical and Translational Science Institute http://ctsi.ucsf.edu/  UCSF Library http://www.library.ucsf.edu/  UCSF Center for Imaging of Neurodegenerative Disease http://www.radiology.ucsf.edu/cind/ http://datashare.ucsf.edu/ “Making data transparent and available is going to accelerate all of science; it's a relatively inexpensive way to get more value out of all of the work that we do” – Michael Weiner, UCSF  Pilot project in biomedical imaging “The goal is to catalyze widespread sharing of scientific research data”  Prepare  Describe  Upload  Curate  Discover  Share
  • 27.
    DataShare – catalyzingscience  UCSF-developed submission client, supporting intuitive drag & drop operation and metadata entry  EZID for DOIs; Merritt for preservation  XTF-based faceted search/browse portal http://xtf.cdlib.org/ http://datashare.ucsf.edu/ “Making data transparent and available is going to accelerate all of science; it's a relatively inexpensive way to get more value out of all of the work that we do” – Michael Weiner, UCSF  Pilot project in biomedical imaging “The goal is to catalyze widespread sharing of scientific research data”  Prepare  Describe  Upload  Curate  Discover  Share
  • 28.
    Research Hub –content mgmt and collaboration  3,900 users  770 projects  Alfresco CMS  Desktop sync  Mobile apps  Abode Creative Suite  Personal file management  Project collaboration  Departmental resource pooling  Research data sharing “Powerful tools for management and collaboration”  Create  Organize and enrich  Keep safe  Share http://hub.berkeley.edu/  UC Berkeley Information Services &Technologies http://ist.berkeley.edu/
  • 29.
    Research Hub –content mgmt and collaboration  3,900 users  770 projects  Alfresco CMS  Desktop sync  Mobile apps  Abode Creative Suite  Personal file management  Project collaboration  Departmental resource pooling  Research data sharing “Powerful tools for management and collaboration”  Create  Organize and enrich  Keep safe  Share http://hub.berkeley.edu/  Primary discovery and access via Research Hub  EZID for DOIs; Merritt for preservation  Merritt access called for in succession plans
  • 30.
    Data curation “Access toand sharing of data are essential for the conduct and advancement of science” — Arzberger et al. (2004), “Promoting access to public research data for scientific, economic, and social development,” Data Science Journal 3: 135-52, doi:10.2481/dsj.3.135  Pro-active curation of research outputs is necessary to ensure their ongoing viability and use  Good for research; good for researchers  Quicker, more innovative science; higher impact factor  Increasingly necessary for conformance to institutional policies, publication requirements, and funder mandates
  • 31.
    Data curation  Widespreadadoption is dependent on outreach, education, and minimal intrusion into existing disciplinary workflows and common community practices  The most effective – and sustainable – curation services are composed from best-of-breed components  Libraries are a natural curation partner for the research community
  • 32.
    For more information UC Curation Center http://www.cdlib.org/uc3/ uc3@ucop.edu Stephen Abrams David Loy Patricia Cruse Mark Reyes Shirin Faenza Joan Starr Scott Fisher Carly Strasser Erik Hetzner Marisa Strong Joshua Hubbard Bhavitavya Vedula Greg Janée Kenneth Weiss John Kunze Perry Willet Rosalie Lack  DataShare http://datashare.ucsf.edu/ Geoffrey Boushey Megan Laurance Anirvan Chatterjee Angela Rizk-Jackson Maninder Kahlon Michael Weiner Julia Kochi  Research Hub http://hub.berkeley.edu/ Ian Crew Patrick McGrath Michael McCarthy Noah Wittman

Editor's Notes

  • #2 Copyright © 2013 by The Regents of the University of CaliforniaThis work is made available under the terms of the Creative Commons Attribution-ShareAlike 3.0 license