Scientific data management (v2)

Uploaded on

Covers basic concepts of data and data management and a few examples of data services provided by universities.

Covers basic concepts of data and data management and a few examples of data services provided by universities.

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Scien&fic  Data  Management   A  tutorial  at  ICADL  2011   October  24,  2011     Jian  Qin   School  of  Informa&on  Studies   Syracuse  University   hGp://    
  • 2. The  morning  ahead   An  environmental  scan   •  E-­‐Science,  cyberinfrastructure,  and  data   •  What  do  all  these    have  to  do  with  me?   Case  study:  The  gravita&onal  wave   research  data  management     Group  work:  Role  play  in   developing  data  management   ini&a&ves    12/18/11  15:51   Overview  of  E-­‐Science   2  
  • 3. An  environmental  scan  •  E-­‐Science,  cyberinfrastructure,  and  data  •  What  do  all  these    have  to  do  with  me?   Overview  of  E-­‐Science   Characteris&cs  of  e-­‐science   Data  sets,  data  collec&ons,  and  data   repositories   Why  does  it  maGer  to  libraries?  
  • 4. E-­‐Science          “In  the  future,  e-­‐Science  will  refer  to  the   large  scale  science  that  will  increasingly  be   carried  out  through  distributed  global   collabora&ons  enabled  by  the  Internet.  ”     Na&onal  e-­‐Science  Center.  (2008).  Defining  e-­‐Science.   hGp://    12/18/11  15:51   Overview  of  E-­‐Science   4  
  • 5. E-­‐Infrastructure  for  the  research    lifecycle   hGp:// 3857/ science_lifecycle_STFC_poster1.PD F     12/18/11  15:51   Overview  of  E-­‐Science   5  
  • 6.  Shib  in  Science  Paradigms   Thousand  years   A  few  hundred   A  few  decades   Today   ago   years  ago   ago   Data  explora7on  (eScience)   unify  theory,  experiment,  and   simula&on   A  computa7onal   -­‐-­‐  Data  captured  by  instruments   approach   or  generated  by  simulator   simula&ng  complex   -­‐-­‐  Processed    by  sobware   Theore7cal  branch     phenomena   -­‐-­‐  Informa&on/Knowledge   using  models,   stored  in  computer   generaliza&ons   -­‐-­‐  Scien&st  analyzes  database/ files  using  data  management   Science  was   and  sta&s&cs  empirical  describing  natural  phenomena   Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed   scien&fic  method.  hGp://­‐us/um/ people/gray/talks/NRC-­‐CSTB_eScience.ppt  
  • 7. 12/18/11  15:51   Overview  of  E-­‐Science   7  
  • 8. Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed   X-­‐Info   scien&fic  method.  hGp://­‐us/um/ people/gray/talks/NRC-­‐CSTB_eScience.ppt  •  The  evolu&on  of  X-­‐Info  and  Comp-­‐X                                                                                     for  each  discipline  X  •  How  to  codify  and  represent  our  knowledge       Experiments  &   Instruments   Other  Archives   facts   ques&ons   Literature   facts   ?   answers   Simula&ons   The  Generic  Problems   •  Data  ingest       •  Managing  a  petabyte   •  Query  and  Vis  tools     •  Building  and  execu&ng  models   •  Common  schema   •  How  to  organize  it     •  Integra&ng  data  and  Literature       •  Documen&ng  experiments   •  How  to  reorganize  it   •  How  to  share  with  others   •  Cura&on  and  long-­‐term  preserva&on  
  • 9. Useful  resources   •  What  is  eScience?         •  eScience  Ini7a7ves         •  Science  Research  and  Data         •  Science  Data  Management         •  Literature  Reviews         •  Data  Policy  Issues         •  eScience  Research  Centers         •  hGp:// op&on=com_content&view=sec&on&idhGp://­‐ =9&Itemid=83  us/collabora&on/fourthparadigm/   12/18/11  15:51   Overview  of  E-­‐Science   9  
  • 10. A  FEW  IMPORTANT  CONCEPTS  12/18/11  15:51   Overview  of  E-­‐Science   10  
  • 11. Data            Any  and  all  complex  data  en&&es  from  observa&ons,  experiments,  simula&ons,  models,  and  higher  order  assemblies,  along  with  the  associated  documenta&on  needed  to  describe  and   An  ar&st’s  concep&on  (above)  depicts   fundamental  NEON  observatory  interpret  the  data. instrumenta&on  and  systems  as  well  as   poten&al  spa&al  organiza&on  of  the   environmental  measurements  made  by  these   instruments  and  systems.   hGp:// nsf0728_4.pdf   12/18/11  15:51   Overview  of  E-­‐Science   11  
  • 12. Scien&fic  data  formats   Common  data  format   Image  formats   Matrix  formats   Microarray  file  formats   Communica&on  protocols  12/18/11  15:51   Overview  of  E-­‐Science   12  
  • 13. Scien&fic  datasets  •  The  scien&fic  data  set,   or  SDS,  is  a  group  of   data  structures  used   to  store  and  describe   mul&dimensional   arrays  of  scien&fic   data.  •  The  boundaries  of   datasets  vary  from   discipline  to  discipline     NCSA  HDF  Development  Group.  (1998).  HDF  4.1r2  Users  Guide.   hGp:// SDS_SD.fm1.html#48894   12/18/11  15:51   Overview  of  E-­‐Science   13  
  • 14. Scien&fic  workflows  •  Steps  in  data  collec&on  and  analysis  process  •  Different  types  of  scien&fic  workflows:   –  Data-­‐intensive   –  Compute-­‐intensive   –  Analysis-­‐intensive   –  Visualiza&on-­‐intensive   Ludäscher,  B.,  Al&ntas,  I.,  Berkley,  C.,  Higgins,  D.,  Jaeger,  E.,  Jones,  E.,  Lee,  E.A.,  Tao,  J.,  &   Zhao,  Y.  (2006).  Scien&fic  workflow  management  and  the  Kepler  system.  Currency  and   Computa>on:  Prac>ce  and  Experience,  18(10):  1039-­‐1065.     12/18/11  15:51   Overview  of  E-­‐Science   14  
  • 15. Example:  Ecological  dataset  •  Floris&c  diversity   data   –  Related  links   –  Data  aGributes   –  Download  link   12/18/11  15:51   Overview  of  E-­‐Science   15  
  • 16. Example:  Biodiversity  dataset  •  Ac7ons  for  Porcupine   Marine  Natural  History   Society  -­‐  Marine  flora  and   fauna  records  from  the   North-­‐east  Atlan7c   –  Metadata  record  output   in  different  standard   formats   –  URL  for  dataset  download     12/18/11  15:51   Overview  of  E-­‐Science   16  
  • 17. Example:  The  Significant  Earthquake   Database     •  The  Significant   Earthquake  Database   –  A  database  containing  data   about  significant   earthquake  events  and  the   damages  caused   –  An  interface  for  extrac&ng   a  subset  of  data   –  A  link  to  download  the   whole  dataset   –  Documenta&on    12/18/11  15:51   Overview  of  E-­‐Science   17  
  • 18. Social  Science  Data   12/18/11  15:51   Overview  of  E-­‐Science   18  
  • 19. Research  data  collec&ons   Data  output                          Size                            Metadata              Management                                                                                                            Standards   Larger,   Mul&ple,   Organized   discipline-­‐ comprehensive   Ins&tu&onalized,     based   Heroic   Smaller,   individual   team-­‐based   None  or   inside  the   random   team  12/18/11  15:51   Overview  of  E-­‐Science   19  
  • 20. Research  collec&ons  •  Limited  processing  or  long-­‐term   management•  Not  conformed  to  any  data   standards•  Varying  sizes  and  formats  of  data   files  •  Low  level  of  processing,  lack  of  plan   for  data  products  •  Low  awareness  of  metadata   standards  and  data  management   issues  12/18/11  15:51   Overview  of  E-­‐Science   20  
  • 21. Resource  collec&ons  •  Authored  by  a  community  of  inves&gators,  within   a  domain  or  science  or  engineering  •  Developed  with  community  level  standards  •  Life  &me  is  between  mid-­‐  and  long-­‐term  •  Example:  Hubbard  Brook  Ecosystem  Study  ( hGp://  )     –  One  of  the  regional  sites  in  the  Long  term   Ecological  Research  Network  (LTER)   –  Community  of  the  ecological  domain   –  Community  of  inves&gators  from  around  the   country  on  ecosystem  study   –  Ecological  Metadata  Language  (EML),  a   community-­‐level  standard   –  Cataloged,  searchable  dataset  collec&ons   12/18/11  15:51   Overview  of  E-­‐Science   21  
  • 22. Reference  collec&on  •  Example:  Global  Biodiversity  Informa&on  Facility   –  Created  by  large  segments  of  science  community     –  Conform  to  robust,  well-­‐established  and  comprehensive   standards,  e.g.   •  ABCD  (Access  to  Biological  Collec&on  Data)     •  Darwin  Core     •  DiGIR  (Distributed  Generic  Informa&on  Retrieval)     •  Dublin  Core  Metadata  standard     •  GGF    (Global  Grid  Forum)     •  Invasive  Alien  Species  Profile     •  LSID  (Life  Sciences  Iden&fier)     •  OGC  (Open  Geospa&al  Consor&um) 12/18/11  15:51   Overview  of  E-­‐Science   22  
  • 23. hGp://  Biodiversity   standards/  Informa7on  Facility  hGp://­‐metadata-­‐infrastructure/   12/18/11  15:51   Overview  of  E-­‐Science   23  
  • 24. Datasets,  data  collec&ons,  and  data   repositories     System  for  storing,   managing,  preserving,   and  providing  access  to  •  Data  collec&ons  are  built  for   datasets     larger  segments  of  science   and  engineering   Data  •  Datasets   repository   –  typically  centered  around  an   A  repository  may   event  or  a  study   contain  one  or  more   –  contain  a  single  file  or  mul&ple   data  collec&ons     files  in  various  formats     A  data  collec&on  may   –  coupled  with  documenta&on   contain  one  or  more   about  the  background  of  data   datasets   collec&on  and  processing     A  dataset  may  contain   one  or  more  data  files  12/18/11  15:51   Overview  of  E-­‐Science   24  
  • 25. An  emerging  trend  in  academic  libraries  12/18/11  15:51   Overview  of  E-­‐Science   25  
  • 26. Ini&a&ves  in  research  libraries   Data  support  and   Libraries  involved  in   services  in   suppor&ng  eScience:   ins&tu&ons:   73%   45%   •  Pressure  points:   –  Lack  of  resources   –  Difficulty  acquiring  the  appropriate  staff  and   exper&se  to  provide  eScience  and  data   management  or  cura&on  services   –  Lack  of  a  unifying  direc&on  on  campus  Source:  Soehner,  C.,  Steeves,  C.  &  Ward,  J.  (2010).  E-­‐Science  and  data  support  services:  A  study  of  ARL  member  ins&tu&on.  hGp://         12/18/11  15:51   Overview  of  E-­‐Science   26  
  • 27. Data  management  challenges  •  No  one-­‐size-­‐fits-­‐all  solu&on  •  Requires  an  in-­‐depth  understanding  of   scien&fic  workflows  and  research  lifecycle  •  Involves  not  only  technical  design  and   planning  but  also  organiza&onal  collabora&on   and  ins&tu&onaliza&on  of  data  policy    12/18/11  15:51   Overview  of  E-­‐Science   27  
  • 28. Data  preserva&on  challenges  •  Data  formats   –  Vary  in  data  types,  e.g.  vector  and  raster  data  types     –  Format  conversions,  e.g.  from  an  old  version  to  a  newer   one  •  Data  rela&ons     –  e.g.  there  are  data  models,  annota&ons,  classifica&on   schemes,  and  symboliza&on  files  for  a  digital  map  •  Seman&c  issues   –  Naming  datasets  and  aGributes  12/18/11  15:51   Overview  of  E-­‐Science   28  
  • 29. Data  access  challenges  •  Reliability    •  Authen&city  •  Leverage  technology  to  make  data  access   easier  and  more  effec&ve   –  Cross-­‐database  search   –  Integra&on  applica&ons  12/18/11  15:51   Overview  of  E-­‐Science   29  
  • 30. Suppor&ng  digital  research  data   •  Lifecycle  of  research  data   –  Create:  data  crea&on/capture/gathering  from  laboratory   experiments,  field  work,  surveys,  devices,  media,   simula&on  output…   –  Edit:  organize,  annotate,  clean,  filter…   –  Use/reuse:  analyze,  mine,  model,  derive  addi&onal  data,   visualize,  input  to  instruments  /computers   –  Publish:  disseminate  data  via  portals  and  associate   datasets  with  research  publica&ons   –  Preserve/destroy:  store  /  preserve,  store  /replicate  / preserve,  store  /  ignore,  destroy…  12/18/11  15:51   Overview  of  E-­‐Science   30  
  • 31. Suppor&ng  data  management   The  data  deluge   Researchers  need:    Numerical,  image,  video   Specialized  search     engines  to  discover  the  Models,  simula&ons,  bit   data  they  need  streams       Powerful  data  mining  XML,  CVS,  DB,  HTML   tools  to  use  and  analyze   the  data   12/18/11  15:51   Overview  of  E-­‐Science   31  
  • 32. Research  data  management   Community   Ins&tu&on   eScience   librarian  Financial  and  policy   support   Science   Data  content   User   domain   idiosyncrasies     requirements   Evolving  and  interconnec&ng  –       Ins&tu&onal   Community   Na&onal   Interna&onal   repository   repository   repository   repository  12/18/11  15:51   Overview  of  E-­‐Science   32  
  • 33. Implica&ons  to  scholarly  communica&on   process   Publishing     Cura&on   Archiving   Data  publishing;   Maintaining,  preserving   The  long-­‐term  storage,  New  scholarly  publishing   and  adding  value  to  digital   retrieval,  and  use  of   models—open  access,   research  data  throughout   scien&fic  data  and   ins&tu&onal  and   its  lifecycle.   methods.  community    repositories,   self-­‐publishing,  library   publishing,  ....     12/18/11  15:51   Overview  of  E-­‐Science   33  
  • 34. 术语的演变 12/18/11  15:50   促进学术交流:如何踢开第一脚?   34  
  • 35. 个案研究1:制定数据保存 分享的机构政策 12/18/11  15:50   促进学术交流:如何踢开第一脚?   35  
  • 36. 有无学科仓储?   现状   有无呈交?   校内仓储有无与学科仓储连接?   院、系服务器   研究人员   数据、 学科仓储   文件   校园服务器   校内机•  什么文件格式?   期刊、会议 构仓储  •  如何组织的?   论文出版  •  如何使用的?  •  能否与非项目团队人员分享?  •  如果能,有什么条件和规定?  •  文件和数据的保存是如何做的?  •  有哪些法律条例需要遵守?   12/18/11  15:50   促进学术交流:如何踢开第一脚?   36  
  • 37. 目标  现状 无统一规章 调查现有 建立统一的数据获条例  机构数据 取、使用、管理、 政策   分享的政策 无文件、数 据管理的认 获取校领 建立机构数据仓储 导及有关识  (campus 部门的支  持  cyberinfrastructure-无数据使用 enabled support) 和分享的政 Proof of  Concept策规定  Project  广泛宣传、用事实  说服研究人员 12/18/11  15:50   促进学术交流:如何踢开第一脚?   37   37  
  • 38. Ac&ons!   校长   VP  for   VP  for   Academic   Research   Affairs   科研处   图书馆   IT  services   iSchool   College⋯   调查现有机构数据政 策,写出报告并给VP   与学校有关部门协作   for  Research提出建议 参考意见  12/18/11  15:50   促进学术交流:如何踢开第一脚?   38  
  • 39. 12/18/11  15:50   促进学术交流:如何踢开第一脚?   39  
  • 41. hGp://    
  • 42. hGps://    
  • 43. hGp://­‐management/    
  • 44. Summary    •  Managing  research  data  is  mo&vated  by:   –  Government  funding  agency’s  policy   –  Needs  for  data  sharing,  cross  valida&on  of  data  and   research,  credit,  and  large-­‐scale  interdisciplinary   discovery  •  Organiza&onal  changes:   –  New  organiza&onal  units  within  the  university  library   or  at  the  university  level   –  Virtual  group     –  Collabora&on  among  key  units:  Libraries,  IT  services,   research  administra&on  office  
  • 45. Summary    •  Types  of  services   –  Training  faculty  and  students  for  data  literacy   –  Data  cura&on  services  (data  repositories,  digital   libraries,  archiving  data)   –  Consul&ng  services   –  Data  management  plan   –  Developing  data  policies