Covers basic concepts of data and data management and a few examples of data services provided by universities.

Covers basic concepts of data and data management and a few examples of data services provided by universities.

  • 1. Scien&fic  Data  Management   A  tutorial  at  ICADL  2011   October  24,  2011     Jian  Qin   School  of  Informa&on  Studies   Syracuse  University   hGp://    
  • 2. The  morning  ahead   An  environmental  scan   •  E-­‐Science,  cyberinfrastructure,  and  data   •  What  do  all  these    have  to  do  with  me?   Case  study:  The  gravita&onal  wave   research  data  management     Group  work:  Role  play  in   developing  data  management   ini&a&ves    12/18/11  15:51   Overview  of  E-­‐Science   2  
  • 3. An  environmental  scan  •  E-­‐Science,  cyberinfrastructure,  and  data  •  What  do  all  these    have  to  do  with  me?   Overview  of  E-­‐Science   Characteris&cs  of  e-­‐science   Data  sets,  data  collec&ons,  and  data   repositories   Why  does  it  maGer  to  libraries?  
  • 4. E-­‐Science          “In  the  future,  e-­‐Science  will  refer  to  the   large  scale  science  that  will  increasingly  be   carried  out  through  distributed  global   collabora&ons  enabled  by  the  Internet.  ”     Na&onal  e-­‐Science  Center.  (2008).  Defining  e-­‐Science.   hGp://    12/18/11  15:51   Overview  of  E-­‐Science   4  
  • 5. E-­‐Infrastructure  for  the  research    lifecycle   hGp:// 3857/ science_lifecycle_STFC_poster1.PD F     12/18/11  15:51   Overview  of  E-­‐Science   5  
  • 6.  Shib  in  Science  Paradigms   Thousand  years   A  few  hundred   A  few  decades   Today   ago   years  ago   ago   Data  explora7on  (eScience)   unify  theory,  experiment,  and   simula&on   A  computa7onal   -­‐-­‐  Data  captured  by  instruments   approach   or  generated  by  simulator   simula&ng  complex   -­‐-­‐  Processed    by  sobware   Theore7cal  branch     phenomena   -­‐-­‐  Informa&on/Knowledge   using  models,   stored  in  computer   generaliza&ons   -­‐-­‐  Scien&st  analyzes  database/ files  using  data  management   Science  was   and  sta&s&cs  empirical  describing  natural  phenomena   Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed   scien&fic  method.  hGp://­‐us/um/ people/gray/talks/NRC-­‐CSTB_eScience.ppt  
  • 8. Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed   X-­‐Info   scien&fic  method.  hGp://­‐us/um/ people/gray/talks/NRC-­‐CSTB_eScience.ppt  •  The  evolu&on  of  X-­‐Info  and  Comp-­‐X                                                                                     for  each  discipline  X  •  How  to  codify  and  represent  our  knowledge       Experiments  &   Instruments   Other  Archives   facts   ques&ons   Literature   facts   ?   answers   Simula&ons   The  Generic  Problems   •  Data  ingest       •  Managing  a  petabyte   •  Query  and  Vis  tools     •  Building  and  execu&ng  models   •  Common  schema   •  How  to  organize  it     •  Integra&ng  data  and  Literature       •  Documen&ng  experiments   •  How  to  reorganize  it   •  How  to  share  with  others   •  Cura&on  and  long-­‐term  preserva&on  
  • 9. Useful  resources   •  What  is  eScience?         •  eScience  Ini7a7ves         •  Science  Research  and  Data         •  Science  Data  Management         •  Literature  Reviews         •  Data  Policy  Issues         •  eScience  Research  Centers         •  hGp:// op&on=com_content&view=sec&on&idhGp://­‐ =9&Itemid=83  us/collabora&on/fourthparadigm/   12/18/11  15:51   Overview  of  E-­‐Science   9  
  • 10. A  FEW  IMPORTANT  CONCEPTS  12/18/11  15:51   Overview  of  E-­‐Science   10  
  • 11. Data            Any  and  all  complex  data  en&&es  from  observa&ons,  experiments,  simula&ons,  models,  and  higher  order  assemblies,  along  with  the  associated  documenta&on  needed  to  describe  and   An  ar&st’s  concep&on  (above)  depicts   fundamental  NEON  observatory  interpret  the  data. instrumenta&on  and  systems  as  well  as   poten&al  spa&al  organiza&on  of  the   environmental  measurements  made  by  these   instruments  and  systems.   hGp:// nsf0728_4.pdf   12/18/11  15:51   Overview  of  E-­‐Science   11  
  • 12. Scien&fic  data  formats   Common  data  format   Image  formats   Matrix  formats   Microarray  file  formats   Communica&on  protocols  12/18/11  15:51   Overview  of  E-­‐Science   12  
  • 13. Scien&fic  datasets  •  The  scien&fic  data  set,   or  SDS,  is  a  group  of   data  structures  used   to  store  and  describe   mul&dimensional   arrays  of  scien&fic   data.  •  The  boundaries  of   datasets  vary  from   discipline  to  discipline     NCSA  HDF  Development  Group.  (1998).  HDF  4.1r2  Users  Guide.   hGp:// SDS_SD.fm1.html#48894   12/18/11  15:51   Overview  of  E-­‐Science   13  
  • 14. Scien&fic  workflows  •  Steps  in  data  collec&on  and  analysis  process  •  Different  types  of  scien&fic  workflows:   –  Data-­‐intensive   –  Compute-­‐intensive   –  Analysis-­‐intensive   –  Visualiza&on-­‐intensive   Ludäscher,  B.,  Al&ntas,  I.,  Berkley,  C.,  Higgins,  D.,  Jaeger,  E.,  Jones,  E.,  Lee,  E.A.,  Tao,  J.,  &   Zhao,  Y.  (2006).  Scien&fic  workflow  management  and  the  Kepler  system.  Currency  and   Computa>on:  Prac>ce  and  Experience,  18(10):  1039-­‐1065.     12/18/11  15:51   Overview  of  E-­‐Science   14  
  • 15. Example:  Ecological  dataset  •  Floris&c  diversity   data   –  Related  links   –  Data  aGributes   –  Download  link   12/18/11  15:51   Overview  of  E-­‐Science   15  
  • 16. Example:  Biodiversity  dataset  •  Ac7ons  for  Porcupine   Marine  Natural  History   Society  -­‐  Marine  flora  and   fauna  records  from  the   North-­‐east  Atlan7c   –  Metadata  record  output   in  different  standard   formats   –  URL  for  dataset  download     12/18/11  15:51   Overview  of  E-­‐Science   16  
  • 17. Example:  The  Significant  Earthquake   Database     •  The  Significant   Earthquake  Database   –  A  database  containing  data   about  significant   earthquake  events  and  the   damages  caused   –  An  interface  for  extrac&ng   a  subset  of  data   –  A  link  to  download  the   whole  dataset   –  Documenta&on    12/18/11  15:51   Overview  of  E-­‐Science   17  
  • 18. Social  Science  Data   12/18/11  15:51   Overview  of  E-­‐Science   18  
  • 19. Research  data  collec&ons   Data  output                          Size                            Metadata              Management                                                                                                            Standards   Larger,   Mul&ple,   Organized   discipline-­‐ comprehensive   Ins&tu&onalized,     based   Heroic   Smaller,   individual   team-­‐based   None  or   inside  the   random   team  12/18/11  15:51   Overview  of  E-­‐Science   19  
  • 20. Research  collec&ons  •  Limited  processing  or  long-­‐term   management•  Not  conformed  to  any  data   standards•  Varying  sizes  and  formats  of  data   files  •  Low  level  of  processing,  lack  of  plan   for  data  products  •  Low  awareness  of  metadata   standards  and  data  management   issues  12/18/11  15:51   Overview  of  E-­‐Science   20  
  • 21. Resource  collec&ons  •  Authored  by  a  community  of  inves&gators,  within   a  domain  or  science  or  engineering  •  Developed  with  community  level  standards  •  Life  &me  is  between  mid-­‐  and  long-­‐term  •  Example:  Hubbard  Brook  Ecosystem  Study  ( hGp://  )     –  One  of  the  regional  sites  in  the  Long  term   Ecological  Research  Network  (LTER)   –  Community  of  the  ecological  domain   –  Community  of  inves&gators  from  around  the   country  on  ecosystem  study   –  Ecological  Metadata  Language  (EML),  a   community-­‐level  standard   –  Cataloged,  searchable  dataset  collec&ons   12/18/11  15:51   Overview  of  E-­‐Science   21  
  • 22. Reference  collec&on  •  Example:  Global  Biodiversity  Informa&on  Facility   –  Created  by  large  segments  of  science  community     –  Conform  to  robust,  well-­‐established  and  comprehensive   standards,  e.g.   •  ABCD  (Access  to  Biological  Collec&on  Data)     •  Darwin  Core     •  DiGIR  (Distributed  Generic  Informa&on  Retrieval)     •  Dublin  Core  Metadata  standard     •  GGF    (Global  Grid  Forum)     •  Invasive  Alien  Species  Profile     •  LSID  (Life  Sciences  Iden&fier)     •  OGC  (Open  Geospa&al  Consor&um) 12/18/11  15:51   Overview  of  E-­‐Science   22  
  • 23. hGp://  Biodiversity   standards/  Informa7on  Facility  hGp://­‐metadata-­‐infrastructure/   12/18/11  15:51   Overview  of  E-­‐Science   23  
  • 24. Datasets,  data  collec&ons,  and  data   repositories     System  for  storing,   managing,  preserving,   and  providing  access  to  •  Data  collec&ons  are  built  for   datasets     larger  segments  of  science   and  engineering   Data  •  Datasets   repository   –  typically  centered  around  an   A  repository  may   event  or  a  study   contain  one  or  more   –  contain  a  single  file  or  mul&ple   data  collec&ons     files  in  various  formats     A  data  collec&on  may   –  coupled  with  documenta&on   contain  one  or  more   about  the  background  of  data   datasets   collec&on  and  processing     A  dataset  may  contain   one  or  more  data  files  12/18/11  15:51   Overview  of  E-­‐Science   24  
  • 25. An  emerging  trend  in  academic  libraries  12/18/11  15:51   Overview  of  E-­‐Science   25  
  • 26. Ini&a&ves  in  research  libraries   Data  support  and   Libraries  involved  in   services  in   suppor&ng  eScience:   ins&tu&ons:   73%   45%   •  Pressure  points:   –  Lack  of  resources   –  Difficulty  acquiring  the  appropriate  staff  and   exper&se  to  provide  eScience  and  data   management  or  cura&on  services   –  Lack  of  a  unifying  direc&on  on  campus  Source:  Soehner,  C.,  Steeves,  C.  &  Ward,  J.  (2010).  E-­‐Science  and  data  support  services:  A  study  of  ARL  member  ins&tu&on.  hGp://         12/18/11  15:51   Overview  of  E-­‐Science   26  
  • 27. Data  management  challenges  •  No  one-­‐size-­‐fits-­‐all  solu&on  •  Requires  an  in-­‐depth  understanding  of   scien&fic  workflows  and  research  lifecycle  •  Involves  not  only  technical  design  and   planning  but  also  organiza&onal  collabora&on   and  ins&tu&onaliza&on  of  data  policy    12/18/11  15:51   Overview  of  E-­‐Science   27  
  • 28. Data  preserva&on  challenges  •  Data  formats   –  Vary  in  data  types,  e.g.  vector  and  raster  data  types     –  Format  conversions,  e.g.  from  an  old  version  to  a  newer   one  •  Data  rela&ons     –  e.g.  there  are  data  models,  annota&ons,  classifica&on   schemes,  and  symboliza&on  files  for  a  digital  map  •  Seman&c  issues   –  Naming  datasets  and  aGributes  12/18/11  15:51   Overview  of  E-­‐Science   28  
  • 29. Data  access  challenges  •  Reliability    •  Authen&city  •  Leverage  technology  to  make  data  access   easier  and  more  effec&ve   –  Cross-­‐database  search   –  Integra&on  applica&ons  12/18/11  15:51   Overview  of  E-­‐Science   29  
  • 30. Suppor&ng  digital  research  data   •  Lifecycle  of  research  data   –  Create:  data  crea&on/capture/gathering  from  laboratory   experiments,  field  work,  surveys,  devices,  media,   simula&on  output…   –  Edit:  organize,  annotate,  clean,  filter…   –  Use/reuse:  analyze,  mine,  model,  derive  addi&onal  data,   visualize,  input  to  instruments  /computers   –  Publish:  disseminate  data  via  portals  and  associate   datasets  with  research  publica&ons   –  Preserve/destroy:  store  /  preserve,  store  /replicate  / preserve,  store  /  ignore,  destroy…  12/18/11  15:51   Overview  of  E-­‐Science   30  
  • 31. Suppor&ng  data  management   The  data  deluge   Researchers  need:    Numerical,  image,  video   Specialized  search     engines  to  discover  the  Models,  simula&ons,  bit   data  they  need  streams       Powerful  data  mining  XML,  CVS,  DB,  HTML   tools  to  use  and  analyze   the  data   12/18/11  15:51   Overview  of  E-­‐Science   31  
  • 32. Research  data  management   Community   Ins&tu&on   eScience   librarian  Financial  and  policy   support   Science   Data  content   User   domain   idiosyncrasies     requirements   Evolving  and  interconnec&ng  –       Ins&tu&onal   Community   Na&onal   Interna&onal   repository   repository   repository   repository  12/18/11  15:51   Overview  of  E-­‐Science   32  
  • 33. Implica&ons  to  scholarly  communica&on   process   Publishing     Cura&on   Archiving   Data  publishing;   Maintaining,  preserving   The  long-­‐term  storage,  New  scholarly  publishing   and  adding  value  to  digital   retrieval,  and  use  of   models—open  access,   research  data  throughout   scien&fic  data  and   ins&tu&onal  and   its  lifecycle.   methods.  community    repositories,   self-­‐publishing,  library   publishing,  ....     12/18/11  15:51   Overview  of  E-­‐Science   33  
  • 36. 有无学科仓储?   现状   有无呈交?   校内仓储有无与学科仓储连接?   院、系服务器   研究人员   数据、 学科仓储   文件   校园服务器   校内机•  什么文件格式?   期刊、会议 构仓储  •  如何组织的?   论文出版  •  如何使用的?  •  能否与非项目团队人员分享?  •  如果能,有什么条件和规定?  •  文件和数据的保存是如何做的?  •  有哪些法律条例需要遵守?   12/18/11  15:50   促进学术交流:如何踢开第一脚?   36  
  • 37. 目标  现状 无统一规章 调查现有 建立统一的数据获条例  机构数据 取、使用、管理、 政策   分享的政策 无文件、数 据管理的认 获取校领 建立机构数据仓储 导及有关识  (campus 部门的支  持  cyberinfrastructure-无数据使用 enabled support) 和分享的政 Proof of  Concept策规定  Project  广泛宣传、用事实  说服研究人员 12/18/11  15:50   促进学术交流:如何踢开第一脚?   37   37  
  • 38. Ac&ons!   校长   VP  for   VP  for   Academic   Research   Affairs   科研处   图书馆   IT  services   iSchool   College⋯   调查现有机构数据政 策,写出报告并给VP   与学校有关部门协作   for  Research提出建议 参考意见  12/18/11  15:50   促进学术交流:如何踢开第一脚?   38  
  • 41. hGp://    
  • 42. hGps://    
  • 43. hGp://­‐management/    
  • 44. Summary    •  Managing  research  data  is  mo&vated  by:   –  Government  funding  agency’s  policy   –  Needs  for  data  sharing,  cross  valida&on  of  data  and   research,  credit,  and  large-­‐scale  interdisciplinary   discovery  •  Organiza&onal  changes:   –  New  organiza&onal  units  within  the  university  library   or  at  the  university  level   –  Virtual  group     –  Collabora&on  among  key  units:  Libraries,  IT  services,   research  administra&on  office  
  • 45. Summary    •  Types  of  services   –  Training  faculty  and  students  for  data  literacy   –  Data  cura&on  services  (data  repositories,  digital   libraries,  archiving  data)   –  Consul&ng  services   –  Data  management  plan   –  Developing  data  policies