• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content




An interoperable data repositories case study: DataONE presented at HUBbub 2012 confence <http:

An interoperable data repositories case study: DataONE presented at HUBbub 2012 confence <http:



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    DataONE_cobb_hubbub2012_20120924_v05 DataONE_cobb_hubbub2012_20120924_v05 Presentation Transcript

    • DataONE:  An  interoperable  data  repositories  case  study    John  W.  Cobb    R&D  Staff  and  DataONE  Leadership  Team  Member  Oak  Ridge  Na;onal  Laboratory    HUBbub  2012  ,  the  HUBzero  conference    Indianapolis,  IN  24  September  2012    
    • Acknowledgment:  •  Authorship:  This  talk   represents  work  of  the   en;re  DataONE   extended  team.  •  It  especially  draws  upon   slide  material  from   •  Bill  Michener,  UNM   (esp.  recent  DataONE   AHM  Sept.  18,  2012)   •  Amber  Budden  –   DataONE  Ass.  Dir.  For   CE  •  DataONE  is  an  NSF   supported  project   (OCI-­‐0830944)   2  
    • Hubs  and  data  repositories  •  A  personal  view     (apologies  for  a  possibly  mis-­‐informed  speaker)  •  HUB-­‐roots  (history  and  pre-­‐history)   •  PUNCH:    web  portal  for  running  tools     (DOI:  10.1109/40.846308)   •  -­‐>  NanoHUB:  Applica;on  orchestra;on  environment   •  +  RAPPTURE:  Rapid  Applica;on  por;ng  and  development   •  +  Framelesss  VNC  windows  –     seamless  hosted  environment  on  clients!   •  +  Rich  collabora;ve  environment  and  rich  user  experience  !!  (“wishlist”)   •  Repurpose:  Hubzero  -­‐>  hubs  explode     (ex.  NEESHub  a  cri;cal  advantage  for  largest  research  award  in  Purdue  history)  •  Now  (and  recent  past)  turn  to  Hub+Data  Integra;on.  Some  successes   already  •  Opportunity:  Richer  interac;ons  between  HUB’s  and  mul;ple  data   repositories  •  Perhaps  for  example:  Enable  mul;-­‐project  collabora;on  within  PURR?  •  Or:  Integrate  NEES  DB’s  with  SCEC  simula;ons  and  IRIS  waveforms?   3  
    • Mul;ple  data  repository  access?  •  HUB  +  Database  exists  •  HUB  +  external  data  repository  access  use  case.  •  But  …..  What  if?      •  Access  mul;ple  (possibly  external)  repositories  from  within  a  HUB   environment?   •  Access  mul;ple  external  repositories  with  similar  data?  Say  aggregate  all  data   from  state  hydrologists?  C.f.  driNET  hip://drinet.hubzero.org   •  Integrate  disparate  data  sets  for  new  and  novel  analysis.   Recall    Noshir  Contractor’s  comments  this  morning:  teaming  and     interdisciplinary  work  has  increased  impact  (Wuchty,  Jones,  Uzzi)   •  Enable  reproducible  analysis  and  synthesis  via  a  automated  workflow  to  create   synthe;c  data  products   •  Programma;c  access   •  More  integra;on  (more  than  just  raw  search  terms  a  la  Google)   •  …   •  What  do  you  want  to  discover  today?  (to  paraphrase  Microsol)   4  
    • DataONE  mo;va;on  •  DataONE  is  a  project  to  address   Plan   these  issues  •  Build  (assemble/aggregate)     Analyze   Collect   data  repository  interoperability  •  Advance  state  of  the  prac;ce   data  lifecycle  management   •  Planning   Integrate   Assure   •  Deposi;on   •  Metadata  genera;on   •  Seman;c  integra;on   Discover   Describe   •  Workflow  and  provenance   •  Analysis   Preserve   •  Synthesis  •  Focus  on  a  broad  science  area  •  Deploy  a  working  CI  and  grow  it  •  DataONE  –  Data  Observa;on   Network  Earth   5  
    • 6  
    • Pressing  issues  for  the  digital  data  lifecycle   Plan   Analyze   Collect   Integrate   Assure   Discover   Describe   Preserve   7  
    • Mul;ple  data  sources  –  mutually  reinforcing   Increasing  Process  Knowledge  Decreasing  Spa;al  Coverage   Intensive  science  sites     and  experiments   Extensive  science  sites   Volunteer  &     educa;on  networks   Remote   sensing   Adapted  from  CENR-­‐OSTP   8  
    • Scaiered  data  sources   “finding  the  needle  in  the  haystack”  Data  are  massively  dispersed   •  Ecological  field  sta;ons  and  research  centers  (100s)   •  Natural  history  museums  and  biocollec;on  facili;es  (100s)   •  Agency  data  collec;ons  (100s  to  1000s)   •  Individual  scien;sts  (1000s  to  10,000s  to  100,000s)   9  
    • Data  Preserva;on  and  Planning   ✔ ?   10  
    • Preserva;on:  Poor  data  prac;ce   “data  entropy”   Time  of  publica?on   Specific  details   General  details   Re?rement  or    Informa?on  Content   career  change   Accident   Death   Time   (Michener  et  al.  1997)   11  
    •  Preserva;on:  Data  longevity   Resource Study Resource Type Half-life Rumsey (2002) Legal Citations 1.4 years Harter and Kim (1996) Scholarly Article Citations 1.5 yearsKoehler (1999 and 2002) Random Web Pages 2.0 years Spinellis (2003) Computer Science 4.0 years Citations Markwell and Brooks Biological Science 4.6 years (2002) Education ResourcesNelson and Allen (2002) Digital Library Object 24.5 years Koehler,  W.  (2004)  Informa(on  Research  9(2):  174.   12  
    • The Long Tail of Orphan Data The  Ultra-­‐violet   Most of the bytes divergence   are at the high end, Specialized repositories but most of the (e.g. GenBank, PDB) datasets are at theVolume low end – Jim Gray The  Infrared   Orphan data Catastrophe   (B. Heidorn) Rank frequency of datatype 13  
    • Data  deluge  and  interoperability   “the  flood  of  increasingly  heterogeneous  data”  Data  are  heterogeneous   •  Syntax   •  (format)   •  Schema   •  (model)   •  Seman;cs   •  (meaning)   Jones  et  al.  2007   14  
    • Metadata  universe  (mul;-­‐verse)  •  There  are  a  mul;tude  of  metadata  standards  •  Discipline  and  sub-­‐discipline  specific  •  Each  with  different  terms  and  context  Source:  Jenn  Riley,  Indiana  U.  Digital  Librarian  hip://www.dlib.indiana.edu/~jenlrile/metadatamap/    Via  John  Kunze,  Cal.  Dig.  Lib     15  
    • Each  dot  is  its  own  standard  !  “…billions  and  billions  of  worlds  …”  –  Carl  Sagan   16  
    • DataONE    CI  architectural  Elements  •  Hard-­‐core  cyberinfrastructure  (CI)   •  CI  Member  Node  (MN)  data   repositories   •  Coordina;ng  Node  (CN)  global   metadata  repo’s   Investigator Toolkit Simple,  but  powerful  REST  API/SPI  for   Web Interface Analysis, Visualization Data Management •  universal  access   Client Libraries •  Inves;gator  toolkit  (ITK)  solware  tools   Java Python Command Line to  allow  access  to  the  data  repository   collec;ve  via  familiar  access  idioms   Member Nodes Coordinating Nodes•  Cultural  and  wetware  issues   Service Interfaces Resolution Discovery Educa;onal  Materials   Service Interfaces •  Replication Registration •  Best  prac;ces   Bridge to non-DataONE Coordination Layer •  Workshops  and  tutorials   Member Node services Identifiers Catalog •  Surveys  and  assessments   Preservation Monitor Data Repository •  Scien;st,  policymaker,  ci;zen   Object Store Index engagement   •  Collabora;on,  governance,  and   sustainability   hip://mule1.dataone.org/ArchitectureDocs-­‐current/       17  
    • A  User’s  View   18  
    • Key  Cyberinfrastructure  Elements   •  Unique  iden;fiers   •  Search  and  deliver   •  Replica;on   •  Federated  iden;ty   Usable  by   People  and  their  Agents   19  
    • Suppor;ng  the  data  lifecycle   ORC   Node   UCSB   Node   UNM   Node   1.  Deposi;on/acquisi;on/ingest   } 2.  Cura;on  and  metadata  management  The  data   3.  Protec;on,  including  privacy  lifecycle   4.  Discovery,  access,  use,  and  dissemina;on     5.  Interoperability,  standards,  and  integra;on   6.  Evalua;on,  analysis,  and  visualiza;on" 20  
    • DataONE  Supports  Data  Preserva;on  Three  major  components  for  a   Member  Nodes  flexible,  scalable,  sustainable   •  diverse  ins;tu;ons   Coordina?ng  Nodes  network   •  serve  local  community   •  retain  complete  metadata   Inves?gator  Toolkit   •  provide  resources  for   catalog     managing  their  data   •  indexing  for  search   •  retain  copies  of  data   •  network-­‐wide  services   •  ensure  content   availability  (preserva;on)       •  replica;on  services   21  
    • DataONE  sa;sfies  arch  requirements  •  Enables  integra;on  of  mul;ple  geographically   diverse  and  metadata  diverse  repositories  •  Presents  collec;ve  search  results  across  mul;ple   repository  •  Provides  a  unified  API/SPI  for  search  and   programma;c  interface   hip://mule1.dataone.org/ArchitectureDocs-­‐current/      •  DataONE  content  has  unique  iden;fiers  (DOI’s)  for   referencable/citable  data  objects  •  Supports  both  large  datasets  and  the  long-­‐tails   22  
    • DataONE  spurs  innova;on  •  Enables  new  analysis  and  synthesis  efforts  by   integra;ng  tasks  across  repositories    •  Provides  means  for  data  replica;on  and  basis  for   repositories  to  build  “data  wills”  or  “data  trust”   plans    •  Provides  a  plavorm  to  develop  advanced   interoperable  workflow  tools  and  seman;c   integra;on  tools     23  
    • DataONE:  current  state/recent  progress   24  
    • DataONE:  Suppor;ng  Scien;fic  Data   Preserva;on,  Discovery,  and  Innova;on     Current  Member  Nodes:                           Coming  Soon:      Current  Tools:            Tools  Coming  Soon:     Queensland  University  of  Technology   25  
    • Data  Management  Planning  Tool   hips://dmp.cdlib.org/     26  
    • 27  
    • 28  
    • Plans  per  template  (as  of  June  2012)   400  Approximate  number  of  plans  per  template   350   339   300   287   Templates  of  greatest  interest  to   the  DataONE  community  in  red;   250        2,302  unique  users  to  date   197   200   159   150   133   133   124   101   100   71   65   60   46   37   36   50   34   17   15   6   0   hip://dmptool.org                     29  
    • ✔ Check  for  best  prac;ces   ✔ Create  metadata   ✔ Connect  to  ONEShare   Data  &    Metadata  (EML)   30  
    • 2.  Data  Discovery   31  
    • The  DataONE  Federa;on   32  
    • ORNL  DAAC      as  a  DataONE  Member  Node     NASA  collectors   DAAC  Users      (UWG)   Inves?gator  Toolkit   DataONE  Users   33  
    • hips://cn.dataone.org/onemercury/     34  
    • 35  
    • 36  
    • Inves;gator  Toolkit  Support     Plan   DMP-Tool Analyze   Collect  Kepler Integrate   Assure   Discover   Describe   Preserve   37  
    • Explora;on,  Visualiza;on,  and  Analysis   Diverse  bird  observa;ons  and   Model  results   environmental  data  from   300,00  loca;ons  in  the  US   Occurrence  of  Indigo  Bun?ng  (2008)   integrated  and  analyzed  using   High  Performance  Compu;ng   Resources  Land  Cover   Jan   Apr   Jun   Sep   Dec  Meteorology   •  Examine  paierns  of   migra;on    MODIS  –   Spa;o-­‐Temporal  Exploratory   •  Infer  how  climate  Remote   Model  iden;fies  factors   change  may  affect   affec;ng  paierns  of  migra;on  sensing  data   bird  migra;on     38  
    • Public  Par?cipa?on  in  Scien?fic  Research  Conference:  4-­‐5  August  2012  in  Portland,  Oregon  USA  prior  to  Ecological  Society  of  America  mee;ng  (6-­‐10  Aug.):  hip://www.birds.cornell.edu/citscitoolkit/conference/2012   39  
    • User  Assessments      Scien;sts:  BL   Scien;sts:  FU   Library  Policies:  BL   Library  Policies:  FU   Librarians:  BL   Librarians:  FU   Policy  Makers:  BL   Policy  Makers:  FU   Educators:  BL   Educators:  FU   Year  1   Year  2   Year  3   Year  4   Year  5   40  
    • What  standard  do  you  currently  use?   676 266 95 95 96 97 12 21 26 DIF DwC DC EML FGDC Open ISO My Lab none GIS Metadata  language   41  
    • Many  are  interested  in  sharing  data  Willing  to  share  data  across  a  broad   81%   group  of  researchers  Willing  to  place  at  least  some  of  my   78%   data  into  a  central  data  repository   with  no  restric;ons  Appropriate  to  create  new  datasets   76%   from  shared  data  Willing  to  place  all  of  my  data  into  a   41%   central  data  repository  with  no   restric;ons   0%   20%   40%   60%   80%   100%   Percent  agree   42  
    • Modeler Scientist Manager Resource Ecological Data Librarians Data Service User  Matrix   Investigator ToolKit Data Management Planning Best Practices Tools Database Training Curricula43  
    • Community  Engagement   44  
    • Best  Prac;ces  and  Solware  Tools   45  
    • 46  
    • DataONE:  Next  steps  •  Member  node  growth   •  Number  of  member  nodes   •  Increase  the  number  and  size  of  data  sets   •  Sustainably   •  In  terms  of  resource  needs  form  MN’s   •  In  terms  of  resource  demands  on  DataONE  •  New  Inves;gator  toolkit  tools  (strategically)  •  An  increasing  number  of  science  use  cases  with   more  breakthrough  science  •  Also,  re-­‐purposing  DataONE  CI  outside  of  Bio/Eco/ Env  areas  in  strategic  collabora;ve  partnerships   47  
    • Ack:  DataONE  Team  and  Sponsors   • Amber  Budden,  Roger  Dahl,  Rebecca  Koskela,    Bill   • Ewa  Deelman   Michener,  Robert  Nahf,  Skye  Roseboom,  Mark   Servilla   • Deborah  McGuinness   •   Dave  Vieglais     • Suzie  Allard,  Nick  Dexter,  Kimberly  Douglass,   • Jeff  Horsburgh   Carol  Tenopir,  Robert  Waltz,  Bruce  Wilson   • John  Cobb,  Bob  Cook,  Ranjeet  Devarakonda,   • Robert  Sandusky   Giri  Palanismy,  Line  Pouchard     • Patricia  Cruse,  John  Kunze   • Bertram  Ludaescher   •  Sky  Bristol,  Mike  Frame,  Richard  Huffine,  Viv   • Peter  Honeyman   Hutchison,  Jeff  Moriseie,  Jake  Weltzin,  Lisa  Zolly   • Stephanie  Hampton,  Chris  Jones,  Mai   • Cliff  Duke   Jones,  Ben  Leinfelder,  Andrew  Pippin   • Paul  Allen,  Rick  Bonney,  Steve  Kelling   • Carole  Goble   • Ryan  Scherle,  Todd  Vision   • Donald  Hobern   • Randy  Butler   • David  DeRoure   LEON LEVY FOUNDATION 48  
    • Ques;ons?     Contact  Points     John  W.  Cobb,  Ph.D.   Oak  Ridge     John  W.  Cobb,  Ph.D.   Oak  Ridge  Na;onal  Lab   cobbjw@ornl.gov   865.576.5439     hip://www.dataone.org/      hip://docs.dataone.org         49