Suppor&ng	
  Data-­‐Rich	
  
Research	
  on	
  Many	
  Fronts	
  
                                 2 1 	
   M a y 	
   2 0 1 2 	
  

  U n i v e r s i t y 	
   o f 	
   C a l i f o r n i a 	
   C u r a & o n 	
   C e n t e r 	
  
                C a l i f o r n i a 	
   D i g i t a l 	
   L i b r a r y 	
  
California	
  Digital	
  Library	
  
Serving	
  the	
  University	
  of	
  California	
     CDL	
  supports	
  the	
  research	
  lifecycle	
  	
  
•  10	
  campuses	
                                    •  Collec&ons	
  
•  360K	
  students,	
  faculty,	
  and	
  staff	
      •  Digital	
  Special	
  Collec&ons	
  
•  100’s	
  of	
  museums,	
  art	
  galleries,	
      •  Discovery	
  &	
  Delivery	
  
   observatories,	
  marine	
  centers,	
              •  Publishing	
  Group	
  
   botanical	
  gardens	
                              •  UC	
  Cura&on	
  Center	
  (UC3)	
  
•  5	
  medical	
  centers	
  
•  5	
  law	
  schools	
  
•  3	
  Na&onal	
  Laboratories	
  
California	
  Digital	
  Library	
  (CDL)	
  
Our	
  environment	
  circa	
  2002-­‐2008	
  
Focus	
  on	
  preserva&on	
  
For	
  memory	
  organiza&ons	
  
Infrastructure:	
  sta&c	
  
Services:	
  hosted	
  
Content:	
  museum	
  &	
  library	
  
Sustainability:	
  ?	
  
Our	
  environment	
  since	
  2008	
  
Focus	
  on	
  preserva&on	
           	
  cura%on	
  (lifecycle)	
  
For	
  memory	
  organiza&ons	
    	
  	
  and	
  now	
  data	
  producers	
  
Infrastructure:	
  sta&c	
             	
  	
  +	
  cloud,	
  VM,	
  bitbucket	
  	
  
Services:	
  hosted	
                   	
  	
  +	
  partnered,	
  self-­‐serve	
  
Content:	
  museum	
  &	
  library	
    	
  	
  +	
  research,	
  web	
  crawls	
  
Sustainability:	
  ?	
                 	
  	
  cost	
  recovery,	
  pay	
  once	
  
Today’s	
  journey	
  
          Data	
  service	
  basics	
  at	
  CDL	
  
               • Stable	
  storage	
  (Merri)	
  
               • Stable	
  iden&fiers	
  (EZID)	
  
               • Data	
  cita&on	
  (DataCite)	
  
               • Management	
  (DMPTool)	
  
               • Preserva&on	
  cost	
  modeling	
  
          ...	
  that	
  enable	
  
               • Federa&on	
  (DataONE)	
  
               • Data	
  papers	
  
               • Capture	
  (WAS	
  web	
  archiving)	
  
               • Excel	
  add-­‐in	
  (DCXL)	
  
The	
  scien&fic	
  record	
  is	
  at	
  risk	
  
Data	
  dissemina&on	
  is	
  rare,	
  risky,	
  expensive,	
  
 labor-­‐intensive,	
  domain-­‐specific,	
  and	
  
 receives	
  lile	
  credit	
  as	
  research	
  output	
  




                   Global	
  Change	
   Galac&c	
  Change	
  
The	
  changing	
  landscape	
  
•  Ever	
  increasing	
  number,	
  size,	
  and	
  
   diversity	
  of	
  content	
  
•  Ever	
  increasing	
  diversity	
  of	
  
   partners,	
  and	
  stakeholders	
  
•  Decreasing	
  resources	
  
•  Inevitability	
  of	
  disrup&ve	
  change	
  
     – Technology	
  
     – Ins&tu&onal	
  mission	
  

                                                       R ESOURCES	
  


                                                                        T IME	
  
Stable	
  storage:	
  	
  Merri	
  repository	
  
               •  Cura&on	
  repository	
  open	
  to	
  the	
  UC	
  
                  community	
  and	
  beyond	
  
               •  Discipline	
  /	
  content	
  agnos&c	
  	
  
               •  Micro-­‐services	
  architecture	
  
               •  Easy-­‐to-­‐use	
  UI	
  or	
  API	
  
               •  Hosted	
  or	
  locally	
  deployed	
  
                Primary	
  FuncAons	
  
                1.	
  Deposit	
  	
  
                2.	
  Manage	
  (metadata,	
  versions,	
  etc)	
  
                3.	
  Access	
  (expose)	
  
                4.	
  Share	
  (with	
  other	
  researchers)	
  
                5.	
  Preserve	
  
EZID:	
  Long	
  term	
  iden%fiers	
  made	
  easy	
  
 •  Precise	
  iden&fica&on	
  of	
  a	
  dataset	
  
    (DOI	
  or	
  ARK)	
  
 •  Credit	
  to	
  data	
  producers	
  and	
  
    data	
  publishers	
  
 •  A	
  link	
  from	
  the	
  tradi&onal	
  
    literature	
  to	
  the	
  data	
  (DataCite)	
  
 •  Exposure	
  and	
  research	
  metrics	
  
    for	
  datasets	
  
    (Web	
  of	
  Knowledge,	
  Google)	
  

                                                        Take	
  control	
  of	
  the	
  
Primary	
  FuncAons	
  
                                                        management	
  and	
  distribu%on	
  of	
  
1.	
  Create	
  persistent	
  iden&fiers	
               your	
  research,	
  share	
  and	
  get	
  
2.	
  Manage	
  iden&fiers	
  (and	
  associated	
       credit	
  for	
  it,	
  and	
  build	
  your	
  
      metadata)	
  over	
  &me	
                        reputa%on	
  through	
  its	
  collec%on	
  
                                                        and	
  documenta%on	
  
3.	
  Resolve	
  iden&fiers	
  
Discovery:	
  DataCite	
  consor&um	
  
•    Technische	
  Informa&onsbibliothek	
  (TIB),	
   •           Canada	
  Ins&tute	
  for	
  Scien&fic	
  and	
  
     Germany	
                                                     Technical	
  Informa&on	
  (CISTI)	
  
                                                              •    L’Ins&tut	
  de	
  l’Informa&on	
  Scien&fique	
  
•    Australian	
  Na&onal	
  Data	
  Service	
  (ANDS)	
  
                                                                   et	
  Technique	
  (INIST),	
  France	
  
•    The	
  Bri&sh	
  Library	
  
                                                              •    Library	
  or	
  the	
  ETH	
  Zürich	
  
•    California	
  Digital	
  Library,	
  USA	
               •    Library	
  of	
  TU	
  Delk,	
  The	
  Netherlands	
  
                                                              •    Office	
  of	
  ScienAfic	
  and	
  Technical	
  
                                                                   InformaAon,	
  US	
  Department	
  of	
  Energy	
  
                                                              •    Purdue	
  University,	
  USA	
  
                                                              •    Technical	
  Informa&on	
  Center	
  of	
  
                                                                   Denmark	
  
DMPTool	
  
  Mee&ng	
  funding	
  agencies	
  data	
  management	
  plan	
  requirements	
  
 •  Connect	
  researchers	
  to	
  resources	
  to	
  
    create	
  a	
  data	
  management	
  plan	
  
 •  NSF	
  and	
  directorates,	
  NIH,	
  NEH,	
  
    IMLS,	
  founda&ons	
  plus	
  
 •  Customizable	
  


Primary	
  FuncAons	
  
1.	
  Step-­‐by-­‐step	
  “wizard”	
  
2.	
  Templates	
  and	
  examples	
  
3.	
  Links	
  to	
  ins&tu&onal	
  resources	
  
      and	
  agency	
  informa&on	
  
4.	
  Plan	
  publica&on	
  and	
  sharing	
  
Number	
  of	
  Plans	
  Created	
  	
  
  Oct	
  2011	
  –	
  Feb	
  2012	
  
Cost	
  Model	
  1:	
  Pay	
  as	
  you	
  go	
  
•  Billed/paid	
  annually	
  

                                                                            {   P 	
  if	
  year = 0
                                                                                	
  0	
  	
  	
  if	
  year > 0


   –  Costs	
  for	
  archival	
  System	
  (A ),	
  Workflows	
  (W ),	
  Content	
  
      Types	
  (C ),	
  Monitoring	
  (M ),	
  and	
  Interven%ons	
  (V )	
  are	
  
      considered	
  common	
  goods,	
  and	
  are	
  appor&oned	
  equally	
  
      across	
  all	
  n	
  Producers	
  (P )	
  
        •  Model	
  components	
  are	
  represented	
  by	
  two	
  terms:	
  the	
  number	
  of	
  
           units	
  and	
  the	
  per-­‐unit	
  cost,	
  e.g.,	
  k ·S
   –  Storage	
  cost	
  (S )	
  accounted	
  on	
  a	
  per-­‐Producer	
  basis	
  
Model	
  2:	
  Pay	
  once,	
  preserve	
  for	
  “ T”	
  years	
  

•  Paid-­‐up	
  price	
  for	
  fixed	
  term T	
  	
      	
  




     –  A	
  func&on	
  of	
  r,	
  the	
  annual	
  investment	
  return,	
  and	
  d,	
  the	
  
        annual	
  decrease	
  in	
  unit	
  cost	
  of	
  preserva&on	
  
     –  G	
   is	
  the	
  cost	
  of	
  providing	
  a	
  year’s	
  preserva&on	
  service;	
  	
  	
  	
  
             	
  



        G0	
  includes	
  the	
  added	
  first	
  year	
  expense	
  of	
  Producer	
  
        engagement	
  and	
  registra&on	
  
     –  Sepng	
  T	
  =	
  ∞	
  calculates	
  the	
  price	
  for	
  “forever”	
  
New	
  distributed	
  framework	
  
           CoordinaAng	
  Nodes	
              Flexible,	
  scalable,	
  
              Member	
  Nodes	
  
•  retain	
  complete	
  metadata	
  
                                              sustainable	
  network	
  
• 	
  catalog	
  	
   ins&tu&ons	
  
      	
  diverse	
  
•  subset	
  of	
  all	
  data	
  
• 	
  	
  serve	
  local	
  community	
  
•  perform	
  basic	
  indexing	
  
• 	
  provide	
  network-­‐wide	
  
•  	
  provide	
  resources	
  for	
  
managing	
  their	
  data	
  
     services	
  
•  ensure	
  data	
  availability	
  
     (preserva&on)	
  	
  	
  
•  provide	
  replica&on	
  
     services	
  
Tradi&onal	
  ar&cles	
  vs	
  data	
  papers	
  
The	
  collec&ve	
  data	
  product	
  
Need	
  to	
  save	
  data	
  +	
  processing	
  




      Algorithms	
  +	
  Data	
  Structures	
  =	
  Programs	
  	
  
Vision	
  for	
  a	
  “data	
  paper”	
  	
  
•  Wrap	
  the	
  unfamiliar	
  in	
  a	
  familiar	
  façade	
  
•  A	
  “data	
  paper”	
  is	
  minimally	
  a	
  cover	
  sheet	
  
   and	
  a	
  set	
  of	
  links	
  to	
  archived	
  ar&facts	
  	
  
•  Cover	
  sheet	
  contains	
  familiar	
  elements:	
  
   &tle,	
  date,	
  authors,	
  abstract,	
  and	
  
   persistent	
  iden&fier	
  (DOI,	
  ARK,	
  etc.)	
  
•  Just	
  enough	
  to	
  permit	
  basic	
  exposure	
  and	
  
   discovery	
  
–  Building	
  a	
  basic	
  data	
  cita&on	
  	
  
–  Indexing	
  by	
  services	
  such	
  as	
  Web	
  of	
  
   Science,	
  Google	
  Scholar	
  
–  Ins&lling	
  	
  confidence	
  in	
  the	
  iden&fier’s	
  	
  
   stability	
  	
  
43 public archives
                                            120+ archives total
                                            58K crawls
                                            7,500 + sites
                                            600 million + URLs
                                            40+ TB
                                            24 institutions




Developed with LoC support by CDL, UNT, and others
What	
  are	
  people	
  using	
  WAS	
  for?	
  
       Archiving	
  at-­‐risk	
  government	
  websites	
  and	
  publica&ons	
  
                 Archiving	
  their	
  own	
  university	
  domains	
  
       Building	
  web	
  archives	
  to	
  complement	
  library	
  collec&ons	
  
           Documen&ng	
  web	
  coverage	
  of	
  significant	
  events	
  
Data	
  cura%on	
  for	
  Excel	
  
•  Excel	
  is	
  the	
  database	
  of	
  choice	
  for	
  many	
  researchers	
  
•  Make	
  it	
  easy	
  to	
  share,	
  archive,	
  	
  and	
  publish	
  data	
  
•  Keep	
  up	
  to	
  date	
  at	
  dcxl.cdlib.org	
  

Primary	
  FuncAons	
                                Surveyed	
  users	
  and	
  found:	
  
                                                     •  Most	
  researchers	
  are	
  unaware	
  of	
  
1.	
  An	
  Excel	
  add-­‐in	
  and	
  web	
  
                                                        preserva&on	
  op&ons	
  
    applica&on	
                                     •  Documenta&on	
  prac&ces	
  are	
  poor	
  
2.	
  Metadata	
  descrip&on	
  (through	
           •  Excel	
  is	
  just	
  one	
  tool	
  in	
  workflows	
  
    extrac&on	
  and	
  augmenta&on)	
  
3.	
  Check	
  for	
  good	
  data	
  prac&ces	
  
3.	
  Transfer	
  to	
  repository	
  	
  
A	
  data	
  cura&on	
  approach	
  at	
  CDL	
  
•  New	
  “data	
  paper”	
  publishing	
  model	
  [GBMF]	
  
•  DataCite	
  consor&um	
  and	
  cita&on	
  standards	
  
•  Other	
  fronts:	
  
   •  DataONE	
  global	
  data	
  network	
  [NSF]	
  
   •  Merri:	
  general-­‐purpose	
  data	
  repository	
  
   •  EZID:	
  scheme-­‐agnos&c	
  &	
  de-­‐coupled	
  crea&on,	
  
      resolu&on,	
  and	
  management	
  of	
  persistent	
  ids	
  
   •  Data	
  management	
  plan	
  generator	
  
   •  Web	
  archiving	
  service	
  [Library	
  of	
  Congress]	
  
   •  Open-­‐source	
  Excel	
  add-­‐in	
  [MS	
  Research	
  &	
  GBMF]	
  
Ques&ons?	
  

John.Kunze@ucop.edu	
  

California	
  Digital	
  Library	
  
 hp://www.cdlib.org/	
  

Supporting Data-Rich Research on Many Fronts

  • 1.
    Suppor&ng  Data-­‐Rich   Research  on  Many  Fronts   2 1   M a y   2 0 1 2   U n i v e r s i t y   o f   C a l i f o r n i a   C u r a & o n   C e n t e r   C a l i f o r n i a   D i g i t a l   L i b r a r y  
  • 2.
    California  Digital  Library   Serving  the  University  of  California   CDL  supports  the  research  lifecycle     •  10  campuses   •  Collec&ons   •  360K  students,  faculty,  and  staff   •  Digital  Special  Collec&ons   •  100’s  of  museums,  art  galleries,   •  Discovery  &  Delivery   observatories,  marine  centers,   •  Publishing  Group   botanical  gardens   •  UC  Cura&on  Center  (UC3)   •  5  medical  centers   •  5  law  schools   •  3  Na&onal  Laboratories  
  • 3.
  • 4.
    Our  environment  circa  2002-­‐2008   Focus  on  preserva&on   For  memory  organiza&ons   Infrastructure:  sta&c   Services:  hosted   Content:  museum  &  library   Sustainability:  ?  
  • 5.
    Our  environment  since  2008   Focus  on  preserva&on      cura%on  (lifecycle)   For  memory  organiza&ons        and  now  data  producers   Infrastructure:  sta&c       +  cloud,  VM,  bitbucket     Services:  hosted        +  partnered,  self-­‐serve   Content:  museum  &  library        +  research,  web  crawls   Sustainability:  ?       cost  recovery,  pay  once  
  • 6.
    Today’s  journey   Data  service  basics  at  CDL   • Stable  storage  (Merri)   • Stable  iden&fiers  (EZID)   • Data  cita&on  (DataCite)   • Management  (DMPTool)   • Preserva&on  cost  modeling   ...  that  enable   • Federa&on  (DataONE)   • Data  papers   • Capture  (WAS  web  archiving)   • Excel  add-­‐in  (DCXL)  
  • 7.
    The  scien&fic  record  is  at  risk   Data  dissemina&on  is  rare,  risky,  expensive,   labor-­‐intensive,  domain-­‐specific,  and   receives  lile  credit  as  research  output   Global  Change   Galac&c  Change  
  • 8.
    The  changing  landscape   •  Ever  increasing  number,  size,  and   diversity  of  content   •  Ever  increasing  diversity  of   partners,  and  stakeholders   •  Decreasing  resources   •  Inevitability  of  disrup&ve  change   – Technology   – Ins&tu&onal  mission   R ESOURCES   T IME  
  • 9.
    Stable  storage:    Merri  repository   •  Cura&on  repository  open  to  the  UC   community  and  beyond   •  Discipline  /  content  agnos&c     •  Micro-­‐services  architecture   •  Easy-­‐to-­‐use  UI  or  API   •  Hosted  or  locally  deployed   Primary  FuncAons   1.  Deposit     2.  Manage  (metadata,  versions,  etc)   3.  Access  (expose)   4.  Share  (with  other  researchers)   5.  Preserve  
  • 10.
    EZID:  Long  term  iden%fiers  made  easy   •  Precise  iden&fica&on  of  a  dataset   (DOI  or  ARK)   •  Credit  to  data  producers  and   data  publishers   •  A  link  from  the  tradi&onal   literature  to  the  data  (DataCite)   •  Exposure  and  research  metrics   for  datasets   (Web  of  Knowledge,  Google)   Take  control  of  the   Primary  FuncAons   management  and  distribu%on  of   1.  Create  persistent  iden&fiers   your  research,  share  and  get   2.  Manage  iden&fiers  (and  associated   credit  for  it,  and  build  your   metadata)  over  &me   reputa%on  through  its  collec%on   and  documenta%on   3.  Resolve  iden&fiers  
  • 11.
    Discovery:  DataCite  consor&um   •  Technische  Informa&onsbibliothek  (TIB),   •  Canada  Ins&tute  for  Scien&fic  and   Germany   Technical  Informa&on  (CISTI)   •  L’Ins&tut  de  l’Informa&on  Scien&fique   •  Australian  Na&onal  Data  Service  (ANDS)   et  Technique  (INIST),  France   •  The  Bri&sh  Library   •  Library  or  the  ETH  Zürich   •  California  Digital  Library,  USA   •  Library  of  TU  Delk,  The  Netherlands   •  Office  of  ScienAfic  and  Technical   InformaAon,  US  Department  of  Energy   •  Purdue  University,  USA   •  Technical  Informa&on  Center  of   Denmark  
  • 12.
    DMPTool   Mee&ng  funding  agencies  data  management  plan  requirements   •  Connect  researchers  to  resources  to   create  a  data  management  plan   •  NSF  and  directorates,  NIH,  NEH,   IMLS,  founda&ons  plus   •  Customizable   Primary  FuncAons   1.  Step-­‐by-­‐step  “wizard”   2.  Templates  and  examples   3.  Links  to  ins&tu&onal  resources   and  agency  informa&on   4.  Plan  publica&on  and  sharing  
  • 13.
    Number  of  Plans  Created     Oct  2011  –  Feb  2012  
  • 14.
    Cost  Model  1:  Pay  as  you  go   •  Billed/paid  annually   { P  if  year = 0  0      if  year > 0 –  Costs  for  archival  System  (A ),  Workflows  (W ),  Content   Types  (C ),  Monitoring  (M ),  and  Interven%ons  (V )  are   considered  common  goods,  and  are  appor&oned  equally   across  all  n  Producers  (P )   •  Model  components  are  represented  by  two  terms:  the  number  of   units  and  the  per-­‐unit  cost,  e.g.,  k ·S –  Storage  cost  (S )  accounted  on  a  per-­‐Producer  basis  
  • 15.
    Model  2:  Pay  once,  preserve  for  “ T”  years   •  Paid-­‐up  price  for  fixed  term T       –  A  func&on  of  r,  the  annual  investment  return,  and  d,  the   annual  decrease  in  unit  cost  of  preserva&on   –  G   is  the  cost  of  providing  a  year’s  preserva&on  service;           G0  includes  the  added  first  year  expense  of  Producer   engagement  and  registra&on   –  Sepng  T  =  ∞  calculates  the  price  for  “forever”  
  • 16.
    New  distributed  framework   CoordinaAng  Nodes   Flexible,  scalable,   Member  Nodes   •  retain  complete  metadata   sustainable  network   •   catalog     ins&tu&ons    diverse   •  subset  of  all  data   •     serve  local  community   •  perform  basic  indexing   •   provide  network-­‐wide   •   provide  resources  for   managing  their  data   services   •  ensure  data  availability   (preserva&on)       •  provide  replica&on   services  
  • 17.
    Tradi&onal  ar&cles  vs  data  papers  
  • 18.
  • 19.
    Need  to  save  data  +  processing   Algorithms  +  Data  Structures  =  Programs    
  • 20.
    Vision  for  a  “data  paper”     •  Wrap  the  unfamiliar  in  a  familiar  façade   •  A  “data  paper”  is  minimally  a  cover  sheet   and  a  set  of  links  to  archived  ar&facts     •  Cover  sheet  contains  familiar  elements:   &tle,  date,  authors,  abstract,  and   persistent  iden&fier  (DOI,  ARK,  etc.)   •  Just  enough  to  permit  basic  exposure  and   discovery   –  Building  a  basic  data  cita&on     –  Indexing  by  services  such  as  Web  of   Science,  Google  Scholar   –  Ins&lling    confidence  in  the  iden&fier’s     stability    
  • 21.
    43 public archives 120+ archives total 58K crawls 7,500 + sites 600 million + URLs 40+ TB 24 institutions Developed with LoC support by CDL, UNT, and others
  • 22.
    What  are  people  using  WAS  for?   Archiving  at-­‐risk  government  websites  and  publica&ons   Archiving  their  own  university  domains   Building  web  archives  to  complement  library  collec&ons   Documen&ng  web  coverage  of  significant  events  
  • 23.
    Data  cura%on  for  Excel   •  Excel  is  the  database  of  choice  for  many  researchers   •  Make  it  easy  to  share,  archive,    and  publish  data   •  Keep  up  to  date  at  dcxl.cdlib.org   Primary  FuncAons   Surveyed  users  and  found:   •  Most  researchers  are  unaware  of   1.  An  Excel  add-­‐in  and  web   preserva&on  op&ons   applica&on   •  Documenta&on  prac&ces  are  poor   2.  Metadata  descrip&on  (through   •  Excel  is  just  one  tool  in  workflows   extrac&on  and  augmenta&on)   3.  Check  for  good  data  prac&ces   3.  Transfer  to  repository    
  • 24.
    A  data  cura&on  approach  at  CDL   •  New  “data  paper”  publishing  model  [GBMF]   •  DataCite  consor&um  and  cita&on  standards   •  Other  fronts:   •  DataONE  global  data  network  [NSF]   •  Merri:  general-­‐purpose  data  repository   •  EZID:  scheme-­‐agnos&c  &  de-­‐coupled  crea&on,   resolu&on,  and  management  of  persistent  ids   •  Data  management  plan  generator   •  Web  archiving  service  [Library  of  Congress]   •  Open-­‐source  Excel  add-­‐in  [MS  Research  &  GBMF]  
  • 25.
    Ques&ons?   John.Kunze@ucop.edu   California  Digital  Library   hp://www.cdlib.org/