Suppor&ng	  Data-­‐Rich	  Research	  on	  Many	  Fronts	                                   2 1 	   M a y 	   2 0 1 2 	    ...
California	  Digital	  Library	  Serving	  the	  University	  of	  California	     CDL	  supports	  the	  research	  lifec...
California	  Digital	  Library	  (CDL)	  
Our	  environment	  circa	  2002-­‐2008	  Focus	  on	  preserva&on	  For	  memory	  organiza&ons	  Infrastructure:	  sta&c...
Our	  environment	  since	  2008	  Focus	  on	  preserva&on	           	  cura%on	  (lifecycle)	  For	  memory	  organiza...
Today’s	  journey	            Data	  service	  basics	  at	  CDL	                 • Stable	  storage	  (Merri)	           ...
The	  scien&fic	  record	  is	  at	  risk	  Data	  dissemina&on	  is	  rare,	  risky,	  expensive,	   labor-­‐intensive,	  ...
The	  changing	  landscape	  •  Ever	  increasing	  number,	  size,	  and	     diversity	  of	  content	  •  Ever	  increa...
Stable	  storage:	  	  Merri	  repository	                 •  Cura&on	  repository	  open	  to	  the	  UC	                ...
EZID:	  Long	  term	  iden%fiers	  made	  easy	   •  Precise	  iden&fica&on	  of	  a	  dataset	      (DOI	  or	  ARK)	   •  ...
Discovery:	  DataCite	  consor&um	  •    Technische	  Informa&onsbibliothek	  (TIB),	   •           Canada	  Ins&tute	  fo...
DMPTool	    Mee&ng	  funding	  agencies	  data	  management	  plan	  requirements	   •  Connect	  researchers	  to	  resou...
Number	  of	  Plans	  Created	  	    Oct	  2011	  –	  Feb	  2012	  
Cost	  Model	  1:	  Pay	  as	  you	  go	  •  Billed/paid	  annually	                                                      ...
Model	  2:	  Pay	  once,	  preserve	  for	  “ T”	  years	  •  Paid-­‐up	  price	  for	  fixed	  term T	  	      	       –  ...
New	  distributed	  framework	             CoordinaAng	  Nodes	              Flexible,	  scalable,	                Member	...
Tradi&onal	  ar&cles	  vs	  data	  papers	  
The	  collec&ve	  data	  product	  
Need	  to	  save	  data	  +	  processing	        Algorithms	  +	  Data	  Structures	  =	  Programs	  	  
Vision	  for	  a	  “data	  paper”	  	  •  Wrap	  the	  unfamiliar	  in	  a	  familiar	  façade	  •  A	  “data	  paper”	  i...
43 public archives                                            120+ archives total                                         ...
What	  are	  people	  using	  WAS	  for?	         Archiving	  at-­‐risk	  government	  websites	  and	  publica&ons	      ...
Data	  cura%on	  for	  Excel	  •  Excel	  is	  the	  database	  of	  choice	  for	  many	  researchers	  •  Make	  it	  ea...
A	  data	  cura&on	  approach	  at	  CDL	  •  New	  “data	  paper”	  publishing	  model	  [GBMF]	  •  DataCite	  consor&um...
Ques&ons?	  California	  Digital	  Library	   hp://	  
Upcoming SlideShare
Loading in...5

Supporting Data-Rich Research on Many Fronts


Published on

Published in: Business, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Supporting Data-Rich Research on Many Fronts

  1. 1. Suppor&ng  Data-­‐Rich  Research  on  Many  Fronts   2 1   M a y   2 0 1 2   U n i v e r s i t y   o f   C a l i f o r n i a   C u r a & o n   C e n t e r   C a l i f o r n i a   D i g i t a l   L i b r a r y  
  2. 2. California  Digital  Library  Serving  the  University  of  California   CDL  supports  the  research  lifecycle    •  10  campuses   •  Collec&ons  •  360K  students,  faculty,  and  staff   •  Digital  Special  Collec&ons  •  100’s  of  museums,  art  galleries,   •  Discovery  &  Delivery   observatories,  marine  centers,   •  Publishing  Group   botanical  gardens   •  UC  Cura&on  Center  (UC3)  •  5  medical  centers  •  5  law  schools  •  3  Na&onal  Laboratories  
  3. 3. California  Digital  Library  (CDL)  
  4. 4. Our  environment  circa  2002-­‐2008  Focus  on  preserva&on  For  memory  organiza&ons  Infrastructure:  sta&c  Services:  hosted  Content:  museum  &  library  Sustainability:  ?  
  5. 5. Our  environment  since  2008  Focus  on  preserva&on      cura%on  (lifecycle)  For  memory  organiza&ons        and  now  data  producers  Infrastructure:  sta&c       +  cloud,  VM,  bitbucket    Services:  hosted        +  partnered,  self-­‐serve  Content:  museum  &  library        +  research,  web  crawls  Sustainability:  ?       cost  recovery,  pay  once  
  6. 6. Today’s  journey   Data  service  basics  at  CDL   • Stable  storage  (Merri)   • Stable  iden&fiers  (EZID)   • Data  cita&on  (DataCite)   • Management  (DMPTool)   • Preserva&on  cost  modeling   ...  that  enable   • Federa&on  (DataONE)   • Data  papers   • Capture  (WAS  web  archiving)   • Excel  add-­‐in  (DCXL)  
  7. 7. The  scien&fic  record  is  at  risk  Data  dissemina&on  is  rare,  risky,  expensive,   labor-­‐intensive,  domain-­‐specific,  and   receives  lile  credit  as  research  output   Global  Change   Galac&c  Change  
  8. 8. The  changing  landscape  •  Ever  increasing  number,  size,  and   diversity  of  content  •  Ever  increasing  diversity  of   partners,  and  stakeholders  •  Decreasing  resources  •  Inevitability  of  disrup&ve  change   – Technology   – Ins&tu&onal  mission   R ESOURCES   T IME  
  9. 9. Stable  storage:    Merri  repository   •  Cura&on  repository  open  to  the  UC   community  and  beyond   •  Discipline  /  content  agnos&c     •  Micro-­‐services  architecture   •  Easy-­‐to-­‐use  UI  or  API   •  Hosted  or  locally  deployed   Primary  FuncAons   1.  Deposit     2.  Manage  (metadata,  versions,  etc)   3.  Access  (expose)   4.  Share  (with  other  researchers)   5.  Preserve  
  10. 10. EZID:  Long  term  iden%fiers  made  easy   •  Precise  iden&fica&on  of  a  dataset   (DOI  or  ARK)   •  Credit  to  data  producers  and   data  publishers   •  A  link  from  the  tradi&onal   literature  to  the  data  (DataCite)   •  Exposure  and  research  metrics   for  datasets   (Web  of  Knowledge,  Google)   Take  control  of  the  Primary  FuncAons   management  and  distribu%on  of  1.  Create  persistent  iden&fiers   your  research,  share  and  get  2.  Manage  iden&fiers  (and  associated   credit  for  it,  and  build  your   metadata)  over  &me   reputa%on  through  its  collec%on   and  documenta%on  3.  Resolve  iden&fiers  
  11. 11. Discovery:  DataCite  consor&um  •  Technische  Informa&onsbibliothek  (TIB),   •  Canada  Ins&tute  for  Scien&fic  and   Germany   Technical  Informa&on  (CISTI)   •  L’Ins&tut  de  l’Informa&on  Scien&fique  •  Australian  Na&onal  Data  Service  (ANDS)   et  Technique  (INIST),  France  •  The  Bri&sh  Library   •  Library  or  the  ETH  Zürich  •  California  Digital  Library,  USA   •  Library  of  TU  Delk,  The  Netherlands   •  Office  of  ScienAfic  and  Technical   InformaAon,  US  Department  of  Energy   •  Purdue  University,  USA   •  Technical  Informa&on  Center  of   Denmark  
  12. 12. DMPTool   Mee&ng  funding  agencies  data  management  plan  requirements   •  Connect  researchers  to  resources  to   create  a  data  management  plan   •  NSF  and  directorates,  NIH,  NEH,   IMLS,  founda&ons  plus   •  Customizable  Primary  FuncAons  1.  Step-­‐by-­‐step  “wizard”  2.  Templates  and  examples  3.  Links  to  ins&tu&onal  resources   and  agency  informa&on  4.  Plan  publica&on  and  sharing  
  13. 13. Number  of  Plans  Created     Oct  2011  –  Feb  2012  
  14. 14. Cost  Model  1:  Pay  as  you  go  •  Billed/paid  annually   { P  if  year = 0  0      if  year > 0 –  Costs  for  archival  System  (A ),  Workflows  (W ),  Content   Types  (C ),  Monitoring  (M ),  and  Interven%ons  (V )  are   considered  common  goods,  and  are  appor&oned  equally   across  all  n  Producers  (P )   •  Model  components  are  represented  by  two  terms:  the  number  of   units  and  the  per-­‐unit  cost,  e.g.,  k ·S –  Storage  cost  (S )  accounted  on  a  per-­‐Producer  basis  
  15. 15. Model  2:  Pay  once,  preserve  for  “ T”  years  •  Paid-­‐up  price  for  fixed  term T       –  A  func&on  of  r,  the  annual  investment  return,  and  d,  the   annual  decrease  in  unit  cost  of  preserva&on   –  G   is  the  cost  of  providing  a  year’s  preserva&on  service;           G0  includes  the  added  first  year  expense  of  Producer   engagement  and  registra&on   –  Sepng  T  =  ∞  calculates  the  price  for  “forever”  
  16. 16. New  distributed  framework   CoordinaAng  Nodes   Flexible,  scalable,   Member  Nodes  •  retain  complete  metadata   sustainable  network  •   catalog     ins&tu&ons    diverse  •  subset  of  all  data  •     serve  local  community  •  perform  basic  indexing  •   provide  network-­‐wide  •   provide  resources  for  managing  their  data   services  •  ensure  data  availability   (preserva&on)      •  provide  replica&on   services  
  17. 17. Tradi&onal  ar&cles  vs  data  papers  
  18. 18. The  collec&ve  data  product  
  19. 19. Need  to  save  data  +  processing   Algorithms  +  Data  Structures  =  Programs    
  20. 20. Vision  for  a  “data  paper”    •  Wrap  the  unfamiliar  in  a  familiar  façade  •  A  “data  paper”  is  minimally  a  cover  sheet   and  a  set  of  links  to  archived  ar&facts    •  Cover  sheet  contains  familiar  elements:   &tle,  date,  authors,  abstract,  and   persistent  iden&fier  (DOI,  ARK,  etc.)  •  Just  enough  to  permit  basic  exposure  and   discovery  –  Building  a  basic  data  cita&on    –  Indexing  by  services  such  as  Web  of   Science,  Google  Scholar  –  Ins&lling    confidence  in  the  iden&fier’s     stability    
  21. 21. 43 public archives 120+ archives total 58K crawls 7,500 + sites 600 million + URLs 40+ TB 24 institutionsDeveloped with LoC support by CDL, UNT, and others
  22. 22. What  are  people  using  WAS  for?   Archiving  at-­‐risk  government  websites  and  publica&ons   Archiving  their  own  university  domains   Building  web  archives  to  complement  library  collec&ons   Documen&ng  web  coverage  of  significant  events  
  23. 23. Data  cura%on  for  Excel  •  Excel  is  the  database  of  choice  for  many  researchers  •  Make  it  easy  to  share,  archive,    and  publish  data  •  Keep  up  to  date  at  Primary  FuncAons   Surveyed  users  and  found:   •  Most  researchers  are  unaware  of  1.  An  Excel  add-­‐in  and  web   preserva&on  op&ons   applica&on   •  Documenta&on  prac&ces  are  poor  2.  Metadata  descrip&on  (through   •  Excel  is  just  one  tool  in  workflows   extrac&on  and  augmenta&on)  3.  Check  for  good  data  prac&ces  3.  Transfer  to  repository    
  24. 24. A  data  cura&on  approach  at  CDL  •  New  “data  paper”  publishing  model  [GBMF]  •  DataCite  consor&um  and  cita&on  standards  •  Other  fronts:   •  DataONE  global  data  network  [NSF]   •  Merri:  general-­‐purpose  data  repository   •  EZID:  scheme-­‐agnos&c  &  de-­‐coupled  crea&on,   resolu&on,  and  management  of  persistent  ids   •  Data  management  plan  generator   •  Web  archiving  service  [Library  of  Congress]   •  Open-­‐source  Excel  add-­‐in  [MS  Research  &  GBMF]  
  25. 25. Ques&ons?  California  Digital  Library   hp://  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.