Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Supporting Data-Rich Research on Many Fronts


Published on

Published in: Business, Education
  • Be the first to comment

  • Be the first to like this

Supporting Data-Rich Research on Many Fronts

  1. 1. Suppor&ng  Data-­‐Rich  Research  on  Many  Fronts   2 1   M a y   2 0 1 2   U n i v e r s i t y   o f   C a l i f o r n i a   C u r a & o n   C e n t e r   C a l i f o r n i a   D i g i t a l   L i b r a r y  
  2. 2. California  Digital  Library  Serving  the  University  of  California   CDL  supports  the  research  lifecycle    •  10  campuses   •  Collec&ons  •  360K  students,  faculty,  and  staff   •  Digital  Special  Collec&ons  •  100’s  of  museums,  art  galleries,   •  Discovery  &  Delivery   observatories,  marine  centers,   •  Publishing  Group   botanical  gardens   •  UC  Cura&on  Center  (UC3)  •  5  medical  centers  •  5  law  schools  •  3  Na&onal  Laboratories  
  3. 3. California  Digital  Library  (CDL)  
  4. 4. Our  environment  circa  2002-­‐2008  Focus  on  preserva&on  For  memory  organiza&ons  Infrastructure:  sta&c  Services:  hosted  Content:  museum  &  library  Sustainability:  ?  
  5. 5. Our  environment  since  2008  Focus  on  preserva&on      cura%on  (lifecycle)  For  memory  organiza&ons        and  now  data  producers  Infrastructure:  sta&c       +  cloud,  VM,  bitbucket    Services:  hosted        +  partnered,  self-­‐serve  Content:  museum  &  library        +  research,  web  crawls  Sustainability:  ?       cost  recovery,  pay  once  
  6. 6. Today’s  journey   Data  service  basics  at  CDL   • Stable  storage  (Merri)   • Stable  iden&fiers  (EZID)   • Data  cita&on  (DataCite)   • Management  (DMPTool)   • Preserva&on  cost  modeling   ...  that  enable   • Federa&on  (DataONE)   • Data  papers   • Capture  (WAS  web  archiving)   • Excel  add-­‐in  (DCXL)  
  7. 7. The  scien&fic  record  is  at  risk  Data  dissemina&on  is  rare,  risky,  expensive,   labor-­‐intensive,  domain-­‐specific,  and   receives  lile  credit  as  research  output   Global  Change   Galac&c  Change  
  8. 8. The  changing  landscape  •  Ever  increasing  number,  size,  and   diversity  of  content  •  Ever  increasing  diversity  of   partners,  and  stakeholders  •  Decreasing  resources  •  Inevitability  of  disrup&ve  change   – Technology   – Ins&tu&onal  mission   R ESOURCES   T IME  
  9. 9. Stable  storage:    Merri  repository   •  Cura&on  repository  open  to  the  UC   community  and  beyond   •  Discipline  /  content  agnos&c     •  Micro-­‐services  architecture   •  Easy-­‐to-­‐use  UI  or  API   •  Hosted  or  locally  deployed   Primary  FuncAons   1.  Deposit     2.  Manage  (metadata,  versions,  etc)   3.  Access  (expose)   4.  Share  (with  other  researchers)   5.  Preserve  
  10. 10. EZID:  Long  term  iden%fiers  made  easy   •  Precise  iden&fica&on  of  a  dataset   (DOI  or  ARK)   •  Credit  to  data  producers  and   data  publishers   •  A  link  from  the  tradi&onal   literature  to  the  data  (DataCite)   •  Exposure  and  research  metrics   for  datasets   (Web  of  Knowledge,  Google)   Take  control  of  the  Primary  FuncAons   management  and  distribu%on  of  1.  Create  persistent  iden&fiers   your  research,  share  and  get  2.  Manage  iden&fiers  (and  associated   credit  for  it,  and  build  your   metadata)  over  &me   reputa%on  through  its  collec%on   and  documenta%on  3.  Resolve  iden&fiers  
  11. 11. Discovery:  DataCite  consor&um  •  Technische  Informa&onsbibliothek  (TIB),   •  Canada  Ins&tute  for  Scien&fic  and   Germany   Technical  Informa&on  (CISTI)   •  L’Ins&tut  de  l’Informa&on  Scien&fique  •  Australian  Na&onal  Data  Service  (ANDS)   et  Technique  (INIST),  France  •  The  Bri&sh  Library   •  Library  or  the  ETH  Zürich  •  California  Digital  Library,  USA   •  Library  of  TU  Delk,  The  Netherlands   •  Office  of  ScienAfic  and  Technical   InformaAon,  US  Department  of  Energy   •  Purdue  University,  USA   •  Technical  Informa&on  Center  of   Denmark  
  12. 12. DMPTool   Mee&ng  funding  agencies  data  management  plan  requirements   •  Connect  researchers  to  resources  to   create  a  data  management  plan   •  NSF  and  directorates,  NIH,  NEH,   IMLS,  founda&ons  plus   •  Customizable  Primary  FuncAons  1.  Step-­‐by-­‐step  “wizard”  2.  Templates  and  examples  3.  Links  to  ins&tu&onal  resources   and  agency  informa&on  4.  Plan  publica&on  and  sharing  
  13. 13. Number  of  Plans  Created     Oct  2011  –  Feb  2012  
  14. 14. Cost  Model  1:  Pay  as  you  go  •  Billed/paid  annually   { P  if  year = 0  0      if  year > 0 –  Costs  for  archival  System  (A ),  Workflows  (W ),  Content   Types  (C ),  Monitoring  (M ),  and  Interven%ons  (V )  are   considered  common  goods,  and  are  appor&oned  equally   across  all  n  Producers  (P )   •  Model  components  are  represented  by  two  terms:  the  number  of   units  and  the  per-­‐unit  cost,  e.g.,  k ·S –  Storage  cost  (S )  accounted  on  a  per-­‐Producer  basis  
  15. 15. Model  2:  Pay  once,  preserve  for  “ T”  years  •  Paid-­‐up  price  for  fixed  term T       –  A  func&on  of  r,  the  annual  investment  return,  and  d,  the   annual  decrease  in  unit  cost  of  preserva&on   –  G   is  the  cost  of  providing  a  year’s  preserva&on  service;           G0  includes  the  added  first  year  expense  of  Producer   engagement  and  registra&on   –  Sepng  T  =  ∞  calculates  the  price  for  “forever”  
  16. 16. New  distributed  framework   CoordinaAng  Nodes   Flexible,  scalable,   Member  Nodes  •  retain  complete  metadata   sustainable  network  •   catalog     ins&tu&ons    diverse  •  subset  of  all  data  •     serve  local  community  •  perform  basic  indexing  •   provide  network-­‐wide  •   provide  resources  for  managing  their  data   services  •  ensure  data  availability   (preserva&on)      •  provide  replica&on   services  
  17. 17. Tradi&onal  ar&cles  vs  data  papers  
  18. 18. The  collec&ve  data  product  
  19. 19. Need  to  save  data  +  processing   Algorithms  +  Data  Structures  =  Programs    
  20. 20. Vision  for  a  “data  paper”    •  Wrap  the  unfamiliar  in  a  familiar  façade  •  A  “data  paper”  is  minimally  a  cover  sheet   and  a  set  of  links  to  archived  ar&facts    •  Cover  sheet  contains  familiar  elements:   &tle,  date,  authors,  abstract,  and   persistent  iden&fier  (DOI,  ARK,  etc.)  •  Just  enough  to  permit  basic  exposure  and   discovery  –  Building  a  basic  data  cita&on    –  Indexing  by  services  such  as  Web  of   Science,  Google  Scholar  –  Ins&lling    confidence  in  the  iden&fier’s     stability    
  21. 21. 43 public archives 120+ archives total 58K crawls 7,500 + sites 600 million + URLs 40+ TB 24 institutionsDeveloped with LoC support by CDL, UNT, and others
  22. 22. What  are  people  using  WAS  for?   Archiving  at-­‐risk  government  websites  and  publica&ons   Archiving  their  own  university  domains   Building  web  archives  to  complement  library  collec&ons   Documen&ng  web  coverage  of  significant  events  
  23. 23. Data  cura%on  for  Excel  •  Excel  is  the  database  of  choice  for  many  researchers  •  Make  it  easy  to  share,  archive,    and  publish  data  •  Keep  up  to  date  at  Primary  FuncAons   Surveyed  users  and  found:   •  Most  researchers  are  unaware  of  1.  An  Excel  add-­‐in  and  web   preserva&on  op&ons   applica&on   •  Documenta&on  prac&ces  are  poor  2.  Metadata  descrip&on  (through   •  Excel  is  just  one  tool  in  workflows   extrac&on  and  augmenta&on)  3.  Check  for  good  data  prac&ces  3.  Transfer  to  repository    
  24. 24. A  data  cura&on  approach  at  CDL  •  New  “data  paper”  publishing  model  [GBMF]  •  DataCite  consor&um  and  cita&on  standards  •  Other  fronts:   •  DataONE  global  data  network  [NSF]   •  Merri:  general-­‐purpose  data  repository   •  EZID:  scheme-­‐agnos&c  &  de-­‐coupled  crea&on,   resolu&on,  and  management  of  persistent  ids   •  Data  management  plan  generator   •  Web  archiving  service  [Library  of  Congress]   •  Open-­‐source  Excel  add-­‐in  [MS  Research  &  GBMF]  
  25. 25. Ques&ons?  California  Digital  Library   hp://