Successfully reported this slideshow.

Talk at OHSU, September 25, 2013


Published on

Presentation on research data management to Oregon Science and HEalth University

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Talk at OHSU, September 25, 2013

  1. 1. Making  Research  Data     Discoverable  and  Usable     (It’s  the  metadata,  stupid!)   Anita  de  Waard   VP  Research  Data  Collabora7ons         h=p://      
  2. 2. Research  data  is  the  ‘new  hotness’…     §  Share  research  outputs   §  Demonstrate  impact  to  public   §  Data  availability  drives  growth   §  Demonstrate  impact     §  Guarantee  permanence,  discoverability     §  Avoid  fraud     §  Generate,  track  outputs   §  Comply  with  mandates   §  Ensure  availability   §  Archive,  track,  curate   §  Support  researcher/ins7tu7on   §  Archive     §  Add  cura7on   §  Allow  reuse       Todd  Vision,  DataDryad,  OAI8,  6/23/13:     “We  need  to  find  a  way  to  keep  Dryad  funded,  and  would   love  to  hear  your  ideas  about  doing  that.”   Phil  Bourne,  Associate  Vice  Chancellor,  UCSD,  4/13:     “We  are  thinking  about  the  university  as  a  digital   enterprise.”   Mike  Huerta,  Ass.  Director  NLM  O  of  Health  Info  at  NIH,  6/13:     “Today,  the  major  public  product  of  science  are  concepts,  wri=en   down  in  papers.  But  tomorrow,  data  will  be  the  main  product  of   science….  We  will  require  scien7sts  to  track  and  share  their  data  as   least  as  well,  if  not  be=er,  than  they  are  sharing  their  ideas  today.”     Mara  Saule,  Dean  University  Libraries/CIO,  UVM,  5/13:     “We  need  to  do  something  about  data.”   §  Derive  credit   §  Comply  with  mandates   §  Discover  and  use     §  Cite/acknowledge   Gov   Funding   bodies   University   management     Researchers   Librarians   Data     Repositories   Nathan  Urban,  PI  Urban  Lab,  CMU,  3/13:     “If  we  can  share  our  data,  we  can  write  a  paper  that  will   knock  everybody’s  socks  off!”   Roles  and  needs  wrt  Research  Data:   Barbara  Ransom,  NSF  Program  Director  Earth  Sciences,  2/13:     “We’re  not  going  to  spend  any  more  money  for  you  to  go  out   and  get  more  data!  We  want  you  first  to  show  us  how  you’re   going  to  use  all  the  data  we  paid  y’all  to  collect  in  the  past!”  
  3. 3. Research  data  management  today:   Using  an7bodies   and  squishy  bits       Grad  Students  experiment   and  enter  details  into  their   lab  notebook.     The  PI  then  tries  to  make     sense  of  their  slides,   and  writes  a  paper.       End  of  story.    
  4. 4. Prepare   Observe   Analyze   Ponder   Communicate   Prepare   Observe   Analyze   Ponder   Communicate   Research  today  (in  biology)  is  o^en   quite  insular:    
  5. 5. But  life  is  VERY  complicated:   h=p://   •  Interspecies  variability:  A  specimen  is  not  a  species   •  Gene  expression  variability:  Knowing  genes  is  not     knowing  how  they  are  expressed   •  Microbiome:  An  animal  is  an  ecosystem   •  Systems  biology:  A  whole  is  more  than  the  sum  of  its   parts       Reduc7onist  science     does  not  work   for  living  systems!  
  6. 6. What  if  the  data  were  connected?   Prepare   Analyze   Communicate   Prepare   Analyze   Communicate   Observa7ons   Observa7ons   Observa7ons   Across  labs,  experiments:   track  reagents  and  how   they  are  used  
  7. 7. Prepare   Analyze   Communicate   Prepare   Analyze   Communicate   Observa7ons   Observa7ons   Observa7ons   Compare  outcome  of   interac7ons  with  these   en77es   What  if  the  data  were  connected?  
  8. 8. Prepare   Analyze   Communicate   Prepare   Analyze  Communicate   Observa7ons   Observa7ons   Observa7ons   Build  a  ‘virtual  reagent   spectrogram’  by  comparing     how  different  en77es     interacted  in  different   experiments   Think   What  if  the  data  were  connected?  
  9. 9. Where  research  data  goes  now:   >  50  My  Papers   2  M  scien7sts   2  My  papers/year   Majority  of  data   (90%?)    is  stored     on  local  hard  drives   Dryad:   7,631  files     Dataverse:   0.6  My       Ins7tu7onal   Repositories     Some  data     (8%?)  stored  in  large,     generic  data     repositories   MiRB:       25k   PetDB:     1,5  k   TAIR:       72,1  k   PDB:       88,3  k     SedDB:     0.6  k   A  small  por7on  of  data     (1-­‐2%?)  stored  in  small,     topic-­‐focused   data  repositories   1.  How  do  we  get   researchers  to  curate,  store   and  share  their  data?     2.  How  do  we  ensure   long-­‐term   sustainability  for  high-­‐ end  repositories?   3.  What  role  do   libraries/ ins7tu7ons  play?    
  10. 10. de  Waard,  A.,  Burton,  S.  et  al.,  2013   1.1.  An  a=empt  to  get  researchers  to  curate   (but  only  parZally  share!)  their  data:    
  11. 11. •  In  220  publica7ons  only  40%  of  an7bodies,  40%  of  cell  lines  and  25%  of   constructs  can  be  manually  iden7fied  (Vasilevsky  et  al,  submi=ed)     •  Proposal  (with  NIH/NIF  and  Force11  Group):     –  Adding  minimal  data  standards   –  Tool  extracts  likely  reagents  /  resources   –  User  interface  asks  author  to  confirm  or  select   1.2.  What  to  do  in  the  mean7me?     49  publica7ons  193  publica7ons   76  publica7ons   214  publica7ons   210  publica7
  12. 12. Pilot  project  with  IEDA:     – Build  a  database  for  lunar  geochemistry   – Write  joint  report  on  building     repository,  cura7on,  costs  and     challenges   2.2  How  can  research  databases   become  long-­‐term  sustainable?    
  13. 13. With  WDS/RDA  WG:     •  Planning  survey  of  cost  recovery  models  for  research   databases   •  Input/inspira7on:  ICPSR  Sloane-­‐funded  project   Sustaining  Domain  Repositories  for  Digital  Data’   •  Developing  overarching  funding  model:   2.2  Cost  recovery  ques7onnaire:  
  14. 14. Private store Data producer or sponsor Access Closed Flow of funds Data publication Public Service Collaboration Conclave  Limited Subscription content    Commercial overlay  Limited Academic Use/Limited Data user Flow of funds Examples ICSPR, CERN- LHC KEGG GeoFacets Reaxys DRAFT - CC-BY-NC 2013, Todd Vision & Anita de Waard Many small operations, e.g., Dryad, arXiv, PDB Commercial and institutional storage  & or 2.3.  A  first  stab  at  a  model:  
  15. 15. 3.1.  Where  do  ins7tu7onal  repositories  fit  in?     Repository   Advantages     Disadvantages   Local  data   repository   Easy!  No  one  steals   your  data.     No  one  sees  it.     Not  compliant  with   requirements   Generic  data   repository   Not  very  hard  to  do.   Have  complied!   Data  can’t  be  easily   reused.  Credit?   Ins7tu7onal   Repository     Can  use  exis7ng  IR?   Tracking  and   compliance  checks.       Data  can’t  easily  be   reused.  Credit?   Domain-­‐specific   data  repository   Data  can  be  reused.   Credit!     Lot  of  work  for   curators.  Long-­‐term   sustainable?     Effort,  Reuse,  Credit,  Compliance   Habit,  Ease,  Privacy,  Control      Higher  quality  metadata  
  16. 16. Funding  Agency:   University:   Collaborators:  Domain  of  study:  Domain-­‐Specific     Data  Repository   Local     Data  Repository   Ins7tu7onal     Data  Repository   Generic    Data  Repository   AND   THEY  ALL   WANT   DIFFERENT   METADATA!!!!   3.2.  The  poor  researcher:    
  17. 17. Domain  repository   3.3.  Possible  pilot  project:   Domain  repository   IR  Data   Metadata:   What  data   was  stored/ viewed   Meta data   Metadata:   What  data   was  stored/ viewed   •  Interview  ins7tu7ons   •  Normalize  repor7ng  data   •  Talking  to     •  IQSS,  Harvard   •  ICPSR,  U  Mich   •  DataDryad,  UNC   •  Pangaea,  Germany  
  18. 18. 3.4.  Ins7tu7onal  Pilot  study:     •  Planning  series  of  interviews  at  key  ins7tu7ons:     –  What  role  do  libraries/ins7tu7ons  play  wrt  research  data   management?     –  What  tools/metadata  standards  are  used?   –  What  aspects  of  data  deposi7on  is  the  Research  Office/ IR/Ins7tu7on  interested  in?     –  How  does  this  compare  with  what  scien7sts  want  and  do   in  their  labs?       •  Outcomes:     –  Share  knowledge  (within  ins7tu7on);     –  Write  joint  report  (anonymised)     –  Establish  joint  plan  of  ac7on  
  19. 19. Elsevier  Research  Data  Services:     •  2013/2013:  Series  of  pilots,  reviews,  and  reports:   -  With  CMU:  Data/metadata  entry  and  sharing   -  With  IEDA:  Repository  crea7on:  feasibility  study  &  report   -  With  RDA:  Cost  of  Data  Repositories  ques7onnaire   -  With  series  of  ins7tutes:  Interviews  re.  role  of  ins7tu7on   •  Main  ques7ons:     -  What  are  key  needs?     -  Can  we  play  a  role:  skillsets,  partnerships?     -  Is  there  a  (transparent)  business  model  for  this?   •  Principles:     –  Collabora7on  is  tailored  to  partner’s  needs,  using  local  resources;     –  Collabora7on  plan  is  MoU/Service-­‐Level  Agreement;   –  At  all  7mes,  all  data,  reports  and  so^ware  are  open  and  shared.    
  20. 20. In  summary:     1.  If  researchers  start  to  curate  and  share  their   data…   2.  And  research  databases  become  long-­‐term   sustainable…   3.  And  libraries,  data  repositories  and  grid   infrastructures  start  to  work  together…     We  might  enable  a  knowledge  infrastructure  that   allows  us  to  jointly  tackle  the  quesZons  of  life!      
  21. 21. Many  ques7ons  remain:   ?  What  carrots    and  s7cks  will  make  researchers   share  their  data?     ?  How  do  we  create  interoperable  metadata   layers?     ?  What  role  would  the  ins7tu7on/library  play?     ?  What  are  sustainable  models,  moving   forward?     ?  Is  there  a  place  for  publishers,  in  all  this?    
  22. 22. Thank  you!   Collabora7ons  and  discussions  gratefully  acknowledged:     •  CMU:  Nathan  Urban,  Shreejoy  Tripathy,  Shawn  Burton,  Ed  Hovy   •  UCSD:  Phil  Bourne,  Brian  Shoe=lander,  David  Minor,  Declan  Fleming,   Ilya  Zaslavsky   •  NIF:  Maryann  Martone,  Anita  Bandrowski   •  MSU:  Brian  Bothner   •  OHSU:  Melissa  Haendel,  Nicole  Vasilevsky   •  California  Digital  Library:  Carly  Strasser,  John  Kunze,  Stephen  Abrams   •  Columbia/IEDA:  Kers7n  Lehnert,  Leslie  Hsu   •  ICPSR:  George  Altman,  Mary  Vardigan   •  CNI:  Clifford  Lynch   •  Harvard:  Michael  Kurtz,  Chris  Erdmann   •  MIT:  Micah  Altman   •  UVM:  Mara  Saurle   •  RDA:  Simon  Hodson,  Michael  Diepenbroek  
  23. 23. Your  ques7ons?     Anita  de  Waard   VP  Research  Data  Collabora7ons,     Elsevier  Research  Data  Services  (VT)   h=p://