• Save
Data Management: Scientist Perspective - UC3 Data Curation Workshop
Upcoming SlideShare
Loading in...5
×
 

Data Management: Scientist Perspective - UC3 Data Curation Workshop

on

  • 363 views

Presentation on data management: the current landscape, barriers to management, and data types. For UC3-CDL data curation for practitioners workshop, 8 Nov 2012 in Oakland CA.

Presentation on data management: the current landscape, barriers to management, and data types. For UC3-CDL data curation for practitioners workshop, 8 Nov 2012 in Oakland CA.

Statistics

Views

Total Views
363
Views on SlideShare
363
Embed Views
0

Actions

Likes
1
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Management: Scientist Perspective - UC3 Data Curation Workshop Data Management: Scientist Perspective - UC3 Data Curation Workshop Presentation Transcript

  • From  Calisphere  via  California  State  University  Libraries,    ark:/13030/c818356g Data A  Scientist’s    @carlystrasser Carly  Strasser Perspective Management
  • Roadmap 5.  Landscape 4.  Solutions 3.  Barriers 2.  How  Bad  Is  It? 1.  A  Brief  History C.  Strasser
  • A BriefFrom  Calisphere  via  Santa  Clara  University,     History of Dataark:/13030/kt696nc7j2 Collection Or…  how  scientists  came  to  be  so   bad  at  data  management
  • The  lab/field  notebook Curie Newton Darwin Da  Vinci classicalschool.blogspot.com
  • From  Flickr  by    DW0825 From  Flickr  by  Flickmor From  Flickr  by    deltaMike The  lab/field  notebook…? www.woodrow.org C.  Strasser Courtesey  of  WHOI From  Flickr  by  US  Army  Environmental  Command
  • Digital  data +   Complex  workflows
  • Data Models Maximum   Likelihood   estimation Matrix   Models Images Tables Paper
  • From  Flickr  by  stevecadman The Wide World of Data
  • Data  Types
  • Dimensions of Data Datum Data  file Metadata Dataset Data  collection Data  repository
  • Data DiversityTemporal   File  size Units File  structure organization File  type Documentatio Spatial n  extent structure Metadata Collection   Codes Project  intent practices Analysis
  • Big DataOSTP  March  2012 Big  Data  Effort  Launched
  • The   LiDle   Guys? From  Flickr  by  jason  tinder
  • The Long Tail Size  of  dataset or grant  ($) #  datasets or  #  researchers or  #  grants
  • From  Flickr  by  Old  Shoe  Woman The Fallout
  • UGLY TRUTH Many  (most?)   researchers…   5shortessays.blogspot.com are  not  taught  data  management don’t  know  what  metadata  are can’t  name  data  centers  or  repositories don’t  share  data  publicly  or  store  it  in  an  archive aren’t  convinced  they  should  share  data
  • Information  Entropy   Fig.  1  of  Michener  et  al.  1997
  • A Real Life Example
  • 2  tables Random  notes C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 23.78 1.17 From  Stephanie  Hampton
  • Wash  Cres  Lake  Dec  15  Dont_Use.xls C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 23.78 1.17 From  Stephanie  Hampton
  • Random  stats  output C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158 B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square -0.022024 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error 1.906378 B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVA C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance F C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278 23.78 1.17 Total 10 35.55962 Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0% Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 From  Stephanie  Hampton
  • C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1 Stable Isotope Data Sheet Sampling Site / Identifier: Wash Cresc Lake Peters lab Dont use - old data Sample Type: Algal Washed Rocks Date: Dec. 16 Tray ID and Sequence: Tray 004 13 15 Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15 Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No. A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354 A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356 A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358 A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22 A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32 A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368 A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370 A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372 B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT B2 ALG02 3 4.51 SampleID -22.68 -22.22 ALG03 0.34 ALG05 4.31 3.66 ALG07 25376 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07 B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158 B5 ALG07 2.9 33.58 Weight (mg) -29.44 -28.98 2.91 1.74 0.62 2.91 -0.03 25382 3.04 2.95 Square 0.080178 R 3.01 3 2.99 2.92 2.9 B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square -0.022024 B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error 1.906378 B8 Lk Outlet Alg 3.04 31.43 -29.69 %C-29.23 6.85 1.07 0.95 35.560.30 25388 33.49 41.17 Observations43.74 11 4.51 1.59 4.37 33.58 B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390 B10 ALG02 3.05 5.52 -22.31 delta 13C -21.85 -21.11 0.45 4.72 -28.054.07 25392 -29.56 -27.32 ANOVA -27.50 -22.68 -24.58 -21.06 -29.44 C1 ALG04 2.98 37.90 delta 13C_ca -27.42 -26.96 -20.65 1.36 1.21 -27.590.56 25394 -29.10 c -26.86 -27.04 df SS -22.22 MS F -24.12 Significance F -20.60 -28.98 C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813 C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278 23.78 %N 0.48 1.17 2.30 1.68 1.97 Total 1.3610 35.55962 0.34 0.15 0.34 1.74 delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62 Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0% Upper 95.0% delta 15N_ca -1.62 -0.06 0.14 2.06 Intercept -4.297428 4.671099 3.66 0.34 -2.34 -2.17 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 -0.03 X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569 4.00 3.00 2.00 1.00 Series1 0.00 -35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00 -1.00 -2.00 -3.00
  • The  Fallout Data   Reuse Data   Sharing Data   Management
  • From  Flickr  by  iowa_spirit_walker Stewardship Why? Hurdles to Data
  • From  Flickr  by  indigoprime Barriers: Cost From  Flickr  by  kobiz7 C.  Strasser
  • Barriers: Sociocultural From  Flickr  by  freefotouk Not  the  norm
  • Barriers: Sociocultural Not  the  norm Lack  of  /  too  many  standards
  • Barriers: Sociocultural Not  the  norm Lack  of  /  too   From  Flickr  by  toucanradio many  standards Disparate  data From  Flickr  by  Chris  Campbell
  • Barriers: Sociocultural From  Flickr  by  uniinnsbruck Not  the  norm Lack  of  /  too  many  standards Disparate  data Lack  of  training
  • From  Flickr  by  Christina  Ann   VanMeter Missed   opportunities Loss  of  rights  or  benefits From  Flickr  by  pnh Barriers: Sociocultural Conflict From  Flickr  by  tymesynk Misuse
  • Barriers: Sociocultural Lack  of  incentives Time   consuming  &   expensive No  From  Flickr  by  bthomso requirements Reward   structure
  • From  Flickr  by    MarqueYe  University generation?But what about the next
  • Are Undergrads Learning AboutData Management? Survey  of  Undergraduate  Ecology  courses 38  Research  focused 48  schools 10  Education  focused Schools  selected… •  One  of  top  grad  schools  for  Ecology •  Most  recipients  of  NSF  GRFP
  • Are Undergrads Learning AboutData Management?Strasser  and  Hampton,  In  press  at  Ecosphere Metadata  generation Software  choice File  naming QAQC Backing  up   Workflows Data  sharing Data  re-­‐‑use Meta-­‐‑analysis Reproducibility Notebook  protocols Databases  
  • Are Undergrads Learning AboutData Management? Quality control & Yes assurance No Naming computer files Types of files & software to use Metadata generation Workflows Protecting data Databases & data archiving Data re-use Meta-analysis Data sharing Reproducibility Notebook protocols 0 20 40 60 80 100 Percent
  • Are Undergrads Learning AboutData Management? 4.5 4.0 Importance for Undergraduates 3.5 3.0 2.5 2.0 r = 0.306 1.5 p < 0.05 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Value as a Researcher
  • Are Undergrads Learning About Data Management? 70 Time 60 Not  appropriate   at  this  level 50 Covered  in  lab Percent Citing Barrier 40 Students  not   Class  too   prepared large Instructor   30 Funding/ not  prepared resources 20 Covered  in   10 other  courses 0 e el ab ed es e ed es Tim lev in l par urc larg par urs his ed re eso too re r co at t ver ot p g/ r ss ot p the te co sn in Cla or n o pria be ent und uct d in pro oul d Stu d F Ins tr re t ap Sh C ove No
  • Solutions
  • Preach the BenefitsShort  term •  Spend  less  time  on  DM  &  more  on  research •  Easier  to  use  data •  Collaborators  can  understand  &  use  data Long  term •  Scientists  outside  project  can  find,  understand  &   use  data •  Credit  for  data  products  &  their  use •  Funders  protect  their  investment  
  • EducateDataONE  Best  Practices  Lessons What  is  the  Data  Life  Cycle?   Data  management  planning   Data  collection,  entry  &  manipulation   Data  quality  control  &  quality  assurance   Data  protection:  Data  backups   Data  preservation,  curation  &  archiving   Data  documentation:   What  is  metadata? Value  of  metadata   How  to  write  good  metadata   Data  management:  Data  sharing   Data  citation  practices   Collaboration/communication  technologies   Data  analysis:  Workflows  and  other  tools  
  • Primer
  • Use New TechnologyE-­‐‑notebooks •  NoteBook •  ORNL  eNote   •  Evernote •  Google  Docs •  Blogs •  wikis •  TheLabNotebook.com •  NoteBookMaker Workflows TheLabNotebook.com!•  Kepler •  Taverna
  • Use New Technology
  • C.  Strasser The Current Landscape
  • Data are being recognized asfirst class products of research From  Flickr  by  Richard  Moross
  • Publishing DataFrom  Flickr  by  Richard  Moross
  • From  Flickr  by  Richard  Moross Citing DataExample: Sidlauskas,  B.  2007.  Data  from:  Testing  for  unequal  rates  of  morphological  diversification  in  the  absence  of  a  detailed  phylogeny:  a  case  study  from  characiform  fishes.  Dryad  Digital  Repository.  doi:10.5061/dryad.20
  • From  Flickr  by  Richard  Moross Data Management RequirementsJournal  publishers
  • From  Flickr  by  Richard  Moross Data Management Requirements Journal  publishers Funders
  • Plan Analyze Collect Integrate Assure From  Flickr  by  darkuncle   Discover Describe Preserve
  • What  is  a   Will  I  get   data   credit  for  my   work? Plan management   plan? Analyze Collect What  tools  do   Are  there   I  use? standards? Integrate Assure What  is   How  much   metadata? Who  can  help   will  it  cost? From  Flickr  by  darkuncle   me? Discover Describe Where  do  I   Preserve How  do  I   preserve  my   preserve  my   data? data?
  • B A C
  • NSF  funded  DataNet  Project Office  of  Cyberinfrastructure Community   Cyberinfrastructure Engagement  &   Outreach Courtesy  of  DataONE
  • My  website carlystrasser.net Email  me carlystrasser@gmail.com Tweet  me @carlystrasser   My  slides slideshare.net/carlystrasser CDL  Blog datapub.cdlib.org