Scien&fic	  Data	  Management	          A	  tutorial	  at	  ICADL	  2011	                October	  24,	  2011	             ...
The	  morning	  ahead	                          An	  environmental	  scan	                          •  E-­‐Science,	  cybe...
An	  environmental	  scan	  •  E-­‐Science,	  cyberinfrastructure,	  and	  data	  •  What	  do	  all	  these	  	  have	  t...
E-­‐Science	     	  	  	  	  “In	  the	  future,	  e-­‐Science	  will	  refer	  to	  the	     large	  scale	  science	  th...
E-­‐Infrastructure	  for	  the	  research	  	  lifecycle	                                                                h...
 Shib	  in	  Science	  Paradigms	       Thousand	  years	           A	  few	  hundred	         A	  few	  decades	         ...
12/18/11	  15:51	     Overview	  of	  E-­‐Science	     7	  
Gray,	  J.	  &	  Szalay,	  A.	  (2007).	  eScience	  –	  A	  transformed	                                             X-­‐...
Useful	  resources	                                        •    What	  is	  eScience?	  	  	  	                           ...
A	  FEW	  IMPORTANT	  CONCEPTS	  12/18/11	  15:51	     Overview	  of	  E-­‐Science	     10	  
Data	  	  	  	  	  	  Any	  and	  all	  complex	  data	  en&&es	  from	  observa&ons,	  experiments,	  simula&ons,	  model...
Scien&fic	  data	  formats	                                   Common	  data	  format	                                      ...
Scien&fic	  datasets	  •  The	  scien&fic	  data	  set,	     or	  SDS,	  is	  a	  group	  of	     data	  structures	  used	 ...
Scien&fic	  workflows	  •  Steps	  in	  data	  collec&on	  and	  analysis	  process	  •  Different	  types	  of	  scien&fic	  ...
Example:	  Ecological	  dataset	  •  Floris&c	  diversity	     data	       –  Related	  links	       –  Data	  aGributes	 ...
Example:	  Biodiversity	  dataset	  •    Ac7ons	  for	  Porcupine	       Marine	  Natural	  History	       Society	  -­‐	 ...
Example:	  The	  Significant	  Earthquake	                       Database	  	                                              ...
Social	  Science	  Data	       12/18/11	  15:51	        Overview	  of	  E-­‐Science	     18	  
Research	  data	  collec&ons	    Data	  output	  	  	  	  	  	  	  	  	  	  	  	  	  Size	  	  	  	  	  	  	  	  	  	  	  ...
Research	  collec&ons	  •  Limited	  processing	  or	  long-­‐term	     management•  Not	  conformed	  to	  any	  data	   ...
Resource	  collec&ons	  •  Authored	  by	  a	  community	  of	  inves&gators,	  within	     a	  domain	  or	  science	  or...
Reference	  collec&on	  •  Example:	  Global	  Biodiversity	  Informa&on	  Facility	        –  Created	  by	  large	  segm...
hGp://	  Biodiversity	                                              standards/	  Informa7on	  Facility	...
Datasets,	  data	  collec&ons,	  and	  data	                        repositories	  	   System	  for	  storing,	           ...
An	  emerging	  trend	  in	  academic	  libraries	  12/18/11	  15:51	          Overview	  of	  E-­‐Science	       25	  
Ini&a&ves	  in	  research	  libraries	           Data	  support	  and	                                                    ...
Data	  management	  challenges	  •  No	  one-­‐size-­‐fits-­‐all	  solu&on	  •  Requires	  an	  in-­‐depth	  understanding	...
Data	  preserva&on	  challenges	  •  Data	  formats	           –  Vary	  in	  data	  types,	  e.g.	  vector	  and	  raster...
Data	  access	  challenges	  •  Reliability	  	  •  Authen&city	  •  Leverage	  technology	  to	  make	  data	  access	   ...
Suppor&ng	  digital	  research	  data	     •  Lifecycle	  of	  research	  data	              –  Create:	  data	  crea&on/c...
Suppor&ng	  data	  management	   The	  data	  deluge	                                                Researchers	  need:	 ...
Research	  data	  management	                                                                                             ...
Implica&ons	  to	  scholarly	  communica&on	                        process	    Publishing	  	                           C...
术语的演变	12/18/11	  15:50	     促进学术交流:如何踢开第一脚?	     34	  
个案研究1:制定数据保存           分享的机构政策	12/18/11	  15:50	     促进学术交流:如何踢开第一脚?	     35	  
有无学科仓储?	                               现状	                  有无呈交?	                                                     校内仓...
目标	    现状	无统一规章                   调查现有                                建立统一的数据获条例	                   机构数据               ...
Ac&ons!	                                                           校长 	                                                   ...
12/18/11	  15:50	     促进学术交流:如何踢开第一脚?	     39	  
Summary	  	  •  Managing	  research	  data	  is	  mo&vated	  by:	      –  Government	  funding	  agency’s	  policy	      –...
Summary	  	  •  Types	  of	  services	      –  Training	  faculty	  and	  students	  for	  data	  literacy	      –  Data	 ...
Scientific data management (v2)
Scientific data management (v2)
Upcoming SlideShare
Loading in …5

Scientific data management (v2)


Published on

Covers basic concepts of data and data management and a few examples of data services provided by universities.

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scientific data management (v2)

  1. 1. Scien&fic  Data  Management   A  tutorial  at  ICADL  2011   October  24,  2011     Jian  Qin   School  of  Informa&on  Studies   Syracuse  University   hGp://    
  2. 2. The  morning  ahead   An  environmental  scan   •  E-­‐Science,  cyberinfrastructure,  and  data   •  What  do  all  these    have  to  do  with  me?   Case  study:  The  gravita&onal  wave   research  data  management     Group  work:  Role  play  in   developing  data  management   ini&a&ves    12/18/11  15:51   Overview  of  E-­‐Science   2  
  3. 3. An  environmental  scan  •  E-­‐Science,  cyberinfrastructure,  and  data  •  What  do  all  these    have  to  do  with  me?   Overview  of  E-­‐Science   Characteris&cs  of  e-­‐science   Data  sets,  data  collec&ons,  and  data   repositories   Why  does  it  maGer  to  libraries?  
  4. 4. E-­‐Science          “In  the  future,  e-­‐Science  will  refer  to  the   large  scale  science  that  will  increasingly  be   carried  out  through  distributed  global   collabora&ons  enabled  by  the  Internet.  ”     Na&onal  e-­‐Science  Center.  (2008).  Defining  e-­‐Science.   hGp://    12/18/11  15:51   Overview  of  E-­‐Science   4  
  5. 5. E-­‐Infrastructure  for  the  research    lifecycle   hGp:// 3857/ science_lifecycle_STFC_poster1.PD F     12/18/11  15:51   Overview  of  E-­‐Science   5  
  6. 6.  Shib  in  Science  Paradigms   Thousand  years   A  few  hundred   A  few  decades   Today   ago   years  ago   ago   Data  explora7on  (eScience)   unify  theory,  experiment,  and   simula&on   A  computa7onal   -­‐-­‐  Data  captured  by  instruments   approach   or  generated  by  simulator   simula&ng  complex   -­‐-­‐  Processed    by  sobware   Theore7cal  branch     phenomena   -­‐-­‐  Informa&on/Knowledge   using  models,   stored  in  computer   generaliza&ons   -­‐-­‐  Scien&st  analyzes  database/ files  using  data  management   Science  was   and  sta&s&cs  empirical  describing  natural  phenomena   Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed   scien&fic  method.  hGp://­‐us/um/ people/gray/talks/NRC-­‐CSTB_eScience.ppt  
  7. 7. 12/18/11  15:51   Overview  of  E-­‐Science   7  
  8. 8. Gray,  J.  &  Szalay,  A.  (2007).  eScience  –  A  transformed   X-­‐Info   scien&fic  method.  hGp://­‐us/um/ people/gray/talks/NRC-­‐CSTB_eScience.ppt  •  The  evolu&on  of  X-­‐Info  and  Comp-­‐X                                                                                     for  each  discipline  X  •  How  to  codify  and  represent  our  knowledge       Experiments  &   Instruments   Other  Archives   facts   ques&ons   Literature   facts   ?   answers   Simula&ons   The  Generic  Problems   •  Data  ingest       •  Managing  a  petabyte   •  Query  and  Vis  tools     •  Building  and  execu&ng  models   •  Common  schema   •  How  to  organize  it     •  Integra&ng  data  and  Literature       •  Documen&ng  experiments   •  How  to  reorganize  it   •  How  to  share  with  others   •  Cura&on  and  long-­‐term  preserva&on  
  9. 9. Useful  resources   •  What  is  eScience?         •  eScience  Ini7a7ves         •  Science  Research  and  Data         •  Science  Data  Management         •  Literature  Reviews         •  Data  Policy  Issues         •  eScience  Research  Centers         •  hGp:// op&on=com_content&view=sec&on&idhGp://­‐ =9&Itemid=83  us/collabora&on/fourthparadigm/   12/18/11  15:51   Overview  of  E-­‐Science   9  
  10. 10. A  FEW  IMPORTANT  CONCEPTS  12/18/11  15:51   Overview  of  E-­‐Science   10  
  11. 11. Data            Any  and  all  complex  data  en&&es  from  observa&ons,  experiments,  simula&ons,  models,  and  higher  order  assemblies,  along  with  the  associated  documenta&on  needed  to  describe  and   An  ar&st’s  concep&on  (above)  depicts   fundamental  NEON  observatory  interpret  the  data. instrumenta&on  and  systems  as  well  as   poten&al  spa&al  organiza&on  of  the   environmental  measurements  made  by  these   instruments  and  systems.   hGp:// nsf0728_4.pdf   12/18/11  15:51   Overview  of  E-­‐Science   11  
  12. 12. Scien&fic  data  formats   Common  data  format   Image  formats   Matrix  formats   Microarray  file  formats   Communica&on  protocols  12/18/11  15:51   Overview  of  E-­‐Science   12  
  13. 13. Scien&fic  datasets  •  The  scien&fic  data  set,   or  SDS,  is  a  group  of   data  structures  used   to  store  and  describe   mul&dimensional   arrays  of  scien&fic   data.  •  The  boundaries  of   datasets  vary  from   discipline  to  discipline     NCSA  HDF  Development  Group.  (1998).  HDF  4.1r2  Users  Guide.   hGp:// SDS_SD.fm1.html#48894   12/18/11  15:51   Overview  of  E-­‐Science   13  
  14. 14. Scien&fic  workflows  •  Steps  in  data  collec&on  and  analysis  process  •  Different  types  of  scien&fic  workflows:   –  Data-­‐intensive   –  Compute-­‐intensive   –  Analysis-­‐intensive   –  Visualiza&on-­‐intensive   Ludäscher,  B.,  Al&ntas,  I.,  Berkley,  C.,  Higgins,  D.,  Jaeger,  E.,  Jones,  E.,  Lee,  E.A.,  Tao,  J.,  &   Zhao,  Y.  (2006).  Scien&fic  workflow  management  and  the  Kepler  system.  Currency  and   Computa>on:  Prac>ce  and  Experience,  18(10):  1039-­‐1065.     12/18/11  15:51   Overview  of  E-­‐Science   14  
  15. 15. Example:  Ecological  dataset  •  Floris&c  diversity   data   –  Related  links   –  Data  aGributes   –  Download  link   12/18/11  15:51   Overview  of  E-­‐Science   15  
  16. 16. Example:  Biodiversity  dataset  •  Ac7ons  for  Porcupine   Marine  Natural  History   Society  -­‐  Marine  flora  and   fauna  records  from  the   North-­‐east  Atlan7c   –  Metadata  record  output   in  different  standard   formats   –  URL  for  dataset  download     12/18/11  15:51   Overview  of  E-­‐Science   16  
  17. 17. Example:  The  Significant  Earthquake   Database     •  The  Significant   Earthquake  Database   –  A  database  containing  data   about  significant   earthquake  events  and  the   damages  caused   –  An  interface  for  extrac&ng   a  subset  of  data   –  A  link  to  download  the   whole  dataset   –  Documenta&on    12/18/11  15:51   Overview  of  E-­‐Science   17  
  18. 18. Social  Science  Data   12/18/11  15:51   Overview  of  E-­‐Science   18  
  19. 19. Research  data  collec&ons   Data  output                          Size                            Metadata              Management                                                                                                            Standards   Larger,   Mul&ple,   Organized   discipline-­‐ comprehensive   Ins&tu&onalized,     based   Heroic   Smaller,   individual   team-­‐based   None  or   inside  the   random   team  12/18/11  15:51   Overview  of  E-­‐Science   19  
  20. 20. Research  collec&ons  •  Limited  processing  or  long-­‐term   management•  Not  conformed  to  any  data   standards•  Varying  sizes  and  formats  of  data   files  •  Low  level  of  processing,  lack  of  plan   for  data  products  •  Low  awareness  of  metadata   standards  and  data  management   issues  12/18/11  15:51   Overview  of  E-­‐Science   20  
  21. 21. Resource  collec&ons  •  Authored  by  a  community  of  inves&gators,  within   a  domain  or  science  or  engineering  •  Developed  with  community  level  standards  •  Life  &me  is  between  mid-­‐  and  long-­‐term  •  Example:  Hubbard  Brook  Ecosystem  Study  ( hGp://  )     –  One  of  the  regional  sites  in  the  Long  term   Ecological  Research  Network  (LTER)   –  Community  of  the  ecological  domain   –  Community  of  inves&gators  from  around  the   country  on  ecosystem  study   –  Ecological  Metadata  Language  (EML),  a   community-­‐level  standard   –  Cataloged,  searchable  dataset  collec&ons   12/18/11  15:51   Overview  of  E-­‐Science   21  
  22. 22. Reference  collec&on  •  Example:  Global  Biodiversity  Informa&on  Facility   –  Created  by  large  segments  of  science  community     –  Conform  to  robust,  well-­‐established  and  comprehensive   standards,  e.g.   •  ABCD  (Access  to  Biological  Collec&on  Data)     •  Darwin  Core     •  DiGIR  (Distributed  Generic  Informa&on  Retrieval)     •  Dublin  Core  Metadata  standard     •  GGF    (Global  Grid  Forum)     •  Invasive  Alien  Species  Profile     •  LSID  (Life  Sciences  Iden&fier)     •  OGC  (Open  Geospa&al  Consor&um) 12/18/11  15:51   Overview  of  E-­‐Science   22  
  23. 23. hGp://  Biodiversity   standards/  Informa7on  Facility  hGp://­‐metadata-­‐infrastructure/   12/18/11  15:51   Overview  of  E-­‐Science   23  
  24. 24. Datasets,  data  collec&ons,  and  data   repositories     System  for  storing,   managing,  preserving,   and  providing  access  to  •  Data  collec&ons  are  built  for   datasets     larger  segments  of  science   and  engineering   Data  •  Datasets   repository   –  typically  centered  around  an   A  repository  may   event  or  a  study   contain  one  or  more   –  contain  a  single  file  or  mul&ple   data  collec&ons     files  in  various  formats     A  data  collec&on  may   –  coupled  with  documenta&on   contain  one  or  more   about  the  background  of  data   datasets   collec&on  and  processing     A  dataset  may  contain   one  or  more  data  files  12/18/11  15:51   Overview  of  E-­‐Science   24  
  25. 25. An  emerging  trend  in  academic  libraries  12/18/11  15:51   Overview  of  E-­‐Science   25  
  26. 26. Ini&a&ves  in  research  libraries   Data  support  and   Libraries  involved  in   services  in   suppor&ng  eScience:   ins&tu&ons:   73%   45%   •  Pressure  points:   –  Lack  of  resources   –  Difficulty  acquiring  the  appropriate  staff  and   exper&se  to  provide  eScience  and  data   management  or  cura&on  services   –  Lack  of  a  unifying  direc&on  on  campus  Source:  Soehner,  C.,  Steeves,  C.  &  Ward,  J.  (2010).  E-­‐Science  and  data  support  services:  A  study  of  ARL  member  ins&tu&on.  hGp://         12/18/11  15:51   Overview  of  E-­‐Science   26  
  27. 27. Data  management  challenges  •  No  one-­‐size-­‐fits-­‐all  solu&on  •  Requires  an  in-­‐depth  understanding  of   scien&fic  workflows  and  research  lifecycle  •  Involves  not  only  technical  design  and   planning  but  also  organiza&onal  collabora&on   and  ins&tu&onaliza&on  of  data  policy    12/18/11  15:51   Overview  of  E-­‐Science   27  
  28. 28. Data  preserva&on  challenges  •  Data  formats   –  Vary  in  data  types,  e.g.  vector  and  raster  data  types     –  Format  conversions,  e.g.  from  an  old  version  to  a  newer   one  •  Data  rela&ons     –  e.g.  there  are  data  models,  annota&ons,  classifica&on   schemes,  and  symboliza&on  files  for  a  digital  map  •  Seman&c  issues   –  Naming  datasets  and  aGributes  12/18/11  15:51   Overview  of  E-­‐Science   28  
  29. 29. Data  access  challenges  •  Reliability    •  Authen&city  •  Leverage  technology  to  make  data  access   easier  and  more  effec&ve   –  Cross-­‐database  search   –  Integra&on  applica&ons  12/18/11  15:51   Overview  of  E-­‐Science   29  
  30. 30. Suppor&ng  digital  research  data   •  Lifecycle  of  research  data   –  Create:  data  crea&on/capture/gathering  from  laboratory   experiments,  field  work,  surveys,  devices,  media,   simula&on  output…   –  Edit:  organize,  annotate,  clean,  filter…   –  Use/reuse:  analyze,  mine,  model,  derive  addi&onal  data,   visualize,  input  to  instruments  /computers   –  Publish:  disseminate  data  via  portals  and  associate   datasets  with  research  publica&ons   –  Preserve/destroy:  store  /  preserve,  store  /replicate  / preserve,  store  /  ignore,  destroy…  12/18/11  15:51   Overview  of  E-­‐Science   30  
  31. 31. Suppor&ng  data  management   The  data  deluge   Researchers  need:    Numerical,  image,  video   Specialized  search     engines  to  discover  the  Models,  simula&ons,  bit   data  they  need  streams       Powerful  data  mining  XML,  CVS,  DB,  HTML   tools  to  use  and  analyze   the  data   12/18/11  15:51   Overview  of  E-­‐Science   31  
  32. 32. Research  data  management   Community   Ins&tu&on   eScience   librarian  Financial  and  policy   support   Science   Data  content   User   domain   idiosyncrasies     requirements   Evolving  and  interconnec&ng  –       Ins&tu&onal   Community   Na&onal   Interna&onal   repository   repository   repository   repository  12/18/11  15:51   Overview  of  E-­‐Science   32  
  33. 33. Implica&ons  to  scholarly  communica&on   process   Publishing     Cura&on   Archiving   Data  publishing;   Maintaining,  preserving   The  long-­‐term  storage,  New  scholarly  publishing   and  adding  value  to  digital   retrieval,  and  use  of   models—open  access,   research  data  throughout   scien&fic  data  and   ins&tu&onal  and   its  lifecycle.   methods.  community    repositories,   self-­‐publishing,  library   publishing,  ....     12/18/11  15:51   Overview  of  E-­‐Science   33  
  34. 34. 术语的演变 12/18/11  15:50   促进学术交流:如何踢开第一脚?   34  
  35. 35. 个案研究1:制定数据保存 分享的机构政策 12/18/11  15:50   促进学术交流:如何踢开第一脚?   35  
  36. 36. 有无学科仓储?   现状   有无呈交?   校内仓储有无与学科仓储连接?   院、系服务器   研究人员   数据、 学科仓储   文件   校园服务器   校内机•  什么文件格式?   期刊、会议 构仓储  •  如何组织的?   论文出版  •  如何使用的?  •  能否与非项目团队人员分享?  •  如果能,有什么条件和规定?  •  文件和数据的保存是如何做的?  •  有哪些法律条例需要遵守?   12/18/11  15:50   促进学术交流:如何踢开第一脚?   36  
  37. 37. 目标  现状 无统一规章 调查现有 建立统一的数据获条例  机构数据 取、使用、管理、 政策   分享的政策 无文件、数 据管理的认 获取校领 建立机构数据仓储 导及有关识  (campus 部门的支  持  cyberinfrastructure-无数据使用 enabled support) 和分享的政 Proof of  Concept策规定  Project  广泛宣传、用事实  说服研究人员 12/18/11  15:50   促进学术交流:如何踢开第一脚?   37   37  
  38. 38. Ac&ons!   校长   VP  for   VP  for   Academic   Research   Affairs   科研处   图书馆   IT  services   iSchool   College⋯   调查现有机构数据政 策,写出报告并给VP   与学校有关部门协作   for  Research提出建议 参考意见  12/18/11  15:50   促进学术交流:如何踢开第一脚?   38  
  39. 39. 12/18/11  15:50   促进学术交流:如何踢开第一脚?   39  
  41. 41. hGp://    
  42. 42. hGps://    
  43. 43. hGp://­‐management/    
  44. 44. Summary    •  Managing  research  data  is  mo&vated  by:   –  Government  funding  agency’s  policy   –  Needs  for  data  sharing,  cross  valida&on  of  data  and   research,  credit,  and  large-­‐scale  interdisciplinary   discovery  •  Organiza&onal  changes:   –  New  organiza&onal  units  within  the  university  library   or  at  the  university  level   –  Virtual  group     –  Collabora&on  among  key  units:  Libraries,  IT  services,   research  administra&on  office  
  45. 45. Summary    •  Types  of  services   –  Training  faculty  and  students  for  data  literacy   –  Data  cura&on  services  (data  repositories,  digital   libraries,  archiving  data)   –  Consul&ng  services   –  Data  management  plan   –  Developing  data  policies