Elag workshop sessie 1 en 2 v10

758 views

Published on

ELAG2012 Workshop 3TU.Datacentrum Session 1 and 2

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
758
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Meta data - Not ‘just’ bibliographic: very domain specific, distinction more fuzzy
  • Meta data - Not ‘just’ bibliographic: very domain specific, distinction more fuzzy
  • Group interaction - 3 groups Draw 3 graphs (multiple lines per graph are allowed but explain what they are) in next … min. Select 3 more ore define your own graphs Present & discuss most interesting one from every group - …
  • Explain form 3 groups .. Min. to select or define new propositions, at least 3 (not from same proposition group) .. Min. to walk around and ‘vote’ on every proposition, please write comments and ‘sign-off’ .. Min. To discuss the opposing opinions
  • Explain form 3 groups .. Min. to select or define new propositions, at least 3 (not from same proposition group) .. Min. to walk around and ‘vote’ on every proposition, please write comments and ‘sign-off’ .. Min. To discuss the opposing opinions
  • Meta data - Not ‘just’ bibliographic: very domain specific, distinction more fuzzy
  • Elag workshop sessie 1 en 2 v10

    1. 1. ELAG Workshop“Data repository challenges” Wednesday, May 16th 2012 Session 1 & 2 Jeroen Rombouts & Egbert Gramsbergen
    2. 2. ProgrammeSession 1 (14:30 – 15:30): “meta - data - value - …”2.Round of introduction: who-is-who and why this workshop?3.Short intro 3TU.DC4.Background information5.Case: Traffic flow observations6.Warming-up GraphsBreakSession 2 (16:00 – 17:00): “producers - consumers - attitudes - …”11.‘Discipline’ differences (researchers & repositories)12.Dotmocracy ‘Lite’13.Conclusions
    3. 3. 1. Who is who?• Who are you?• Why interested in this topic?
    4. 4. 2. 3TU.Datacentrum = …• 3 Dutch TU’s: Delft, Eindhoven, Twente• Project 2008-2011, going concern 2012-• Data archive – 2008 - – “finished” data – preserve but do not forget usability – meta data harvestable (OAI-PMH) – crawlable (OAI-ORE linked data) – data citation information (incl. DataCite DOI’s)• Data labs – Just starting (hosting) – Unfinished data + software/scripts
    5. 5. Website & Data-archive • http://datacentrum.3tu.nl • Information News, announcements Publications, links and tutorials• http://data.3tu.nl• Data sets download and ‘management’• ‘Use’ data with Google Maps/Earth, OPeNDAP, …
    6. 6. Data archiving options• ‘Simple’ sets (Do It Yourself) Standard (self)upload form and descriptive information, single file per object (can be a ‘zipped’ collection), single DOI, … E.g.: Zandvliet, H.J.W. et al. (2010): Diffusion driven concerted motion of surface atoms: Ge on Ge(001). MESA+ Institute For Nanotechnology, University of Twente. doi:10.4121/uuid:3f71549c-6097-4bb8-bc00-6db77deb161d• Special collections (Do It Together) Negotiate: deposit procedure, description (xml, picture, preview), data model, level of DOI assignment, query online, … E.g.: Otto, T., Russchenberg, H.W.J. (2010): IDRA weather radar measurements - all data. TU Delft - Delft University of Technology. doi:10.4121/uuid:5f3bcaa2-a456-4a66-a67b-1eec928cae6d
    7. 7. Training & Data-labs • http://dataintelligence.3tu.nl • Reference, News & Events for training library staff.• OpenEarth, SHARE, …?
    8. 8. Questions
    9. 9. 3. Background information• Workshop scope – Need for change ?/! – Questions (for now)• Report inputs – NSF/NSB: Definitions – RIN: Discipline/Data Differences – DANS/3TU.DC: Value/selection/DSA/…???
    10. 10. Data Deluge • Data in 2015 approx. 18 million times Library of Congress (in size). • Video data in 2005 half of all digital data. • According to Eric Sieverts: At current growth rate in 2210 number of bytes equal to number of atoms on planet earth. (predicts that before that happens something will change ;-)) • CERN-LHC: 10-15PB/yr.
    11. 11. Workshop scopePreconditions• Challenge: Too much data (to keep).Technology (storage capacity, cooling, energy), organizations (strategies, budgets) andpeople (awareness, training) can’t keep (this) up!• Upside: Not all data is valuable in the futuresome relevant (de)selection experience in archiving, some efficiency improvements,‘some’ increase in storage capacity, …QuestionsF.Which research output to share and preserve?G.Who are the players involved?H.How to collect and preserve the research output? Roles of University Libraries…Conclusions on differences between documents and research data?
    12. 12. NSF/NSB - 1/3• Data.For the purposes of this document, data are any and all complex data entities from observations, experiments, simulations, models, and higher order assemblies, along with the associated documentation needed to describe and interpret the data.• Metadata.Metadata are a subset of data, and are data about data. Metadata summarize data content, context, structure, interrelationships, and provenance (information on history and origins). They add relevance and purpose to data, and enable the identification of similar data in different data collections.
    13. 13. NSF/NSB - 2/33 functional types of data collections:•Research CollectionsAuthors are individual investigators and investigator teams.Research collections are usually maintained to serve immediate groupparticipants only for the life of a project, and are typically subjected to limitedprocessing or curation. Data may not conform to any data standards.•Resource CollectionsResource collections are authored by a community of investigators, often withina domain of science or engineering, and are often developed with communitylevel standards. Budgets are often intermediate in size.Lifetime is between the mid- and long-term.
    14. 14. NSF/NSB - 3/3• Reference CollectionsReference collections are authored by and serve large segments of the science and engineering community and conform to robust, well- established and comprehensive standards, which often lead to a universal standard. Budgets are large and are often derived from diverse sources with a view to indefinite support.[NSF, Originally: National Science Board report onLong-Lived Digital Data Collections, …]Differences:• Community size• Collection lifetime• Level of standardization• Amount of processing• Budget size & sources• …
    15. 15. RIN• Many different kinds and categories of data: – scientific experiments; – models or simulations; and – observations of specific phenomena at a specific time or location.…• Datasets are generated for different purposes and through different processes.• Data may undergo various stages of transformation.• The quality of metadata provided for research datasets is very variable.• Varying degrees of data management, efforts, resources and expertise.• There are significant variations – as well as commonalities - in researchers’ attitudes, behaviors and needs, in the available infrastructure, and in the nature and effect of policy initiatives, in different disciplines and subject areas• …
    16. 16. DANS/3TU.DCKey findings•No solid definition of “research data” found•Lot of literature on selection process, but…•Not a single case of selection policy of digital data found Apparently a lot of implicit selection going on considering the availabledigital research dataReasons for preserving research data:h)Obligation to enable re-use (by funder, publisher)i)Other arguments: inter or intra disciplinary value, hard to repeat, valuefor historic researchj)Obligation for verification (by code of conduct, employer, publisher)k)Non scientific arguments (heritage, responsibilty to society)
    17. 17. Docs vs. Data (Differences)• Object sizes (capacity)• Collection sizes/granularity (number or objects)• Meta data (type, standards and distinction from object)• Heterogeneity of collections (not discipline differences) – Data category (experiment, model/simulation, observation) – Data generation process (man made vs. machine made or …) – File formats• Attitudes to ‘publishing’• Resources, expertise, efforts on data management• Selection inevitable• Value?• …• … Anything to add?(list to be expanded in workshop)
    18. 18. Questions, suggestions, …
    19. 19. 4. Case: Traffic flow observations• Case Researchers needed to clear the disk space and offered data which where “expensive to gather and had required quite a lot of computation to process.” Project was already finished.• Content Pictures of highway stretches shot from helicopter. Shoulder open/closed, several flights, raw/stabilized, several dates, calibration image, calibration software and settings.
    20. 20. Questions for case• Which data to ingest? raw pictures, stabilized pictures, movies or … vectors and type of cars? GPS logs calibration image stabilisation software/data• Who are involved? data-producer (researcher) research funder (owner) data repository• How to preserve? gps logs: as data or meta data, all flight data or only when recording? the software (code or executable?) picture formats (tiff, png, jpeg2000, …)? granularity (per flight, per location, per recording, ...?
    21. 21. The data
    22. 22. Collection
    23. 23. Top level dataset
    24. 24. Low level dataset (stabilized data)
    25. 25. …• …
    26. 26. Citation information
    27. 27. Docs vs. Data (Differences)• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections – Data category (experiment, model/simulation, observation) – Data generation process (man made vs. machine made or …) – File formats• Attitudes to ‘publishing’• Resources, expertise, efforts on data management• Selection inevitable• Value?• Citation practice• …• … Anything to add?
    28. 28. Questions, suggestions, …
    29. 29. 5. To the graphs…
    30. 30. Break
    31. 31. Session 2Session 2 (16:00 – 17:00): “producers - consumers - attitudes - …”3.‘Discipline’ differences (researchers & repositories)4.Dotmocracy ‘Lite’5.Preliminary conclusions?Back to plenary presentations
    32. 32. What our accountmanagers ‘sell’…The benefits for data producers and data consumers • Increased visibility of research output. (metadata in repository networks, assigning doi’s, facilitate increases citation rate for ‘enhanced publications’, ...); • Improved quality of dataset (quality assurance for multi- user setup, checks on ingest, …); • Provide (long-term) preservation of and accessibility to, valuable research data; • Distribution of research data for reuse, including administration and usage statistics; • Provides advice on data management, rights, formats, metadata, etc.
    33. 33. Value Secure research data Cite/Claim (DOIs) Quality Assurance (support) Data exchange Data visibility  Support EU projects, Communities  Extra show window  Relation with non-academic research, society  Prepare for paradigm shift  Enable verification
    34. 34. What do data producers say? 1/2 Only for long term Datasets are stored by continuous data No time! publisher Our research is once onlyInteresting but not for meNobody needs my data Our datasets are Data transfer not confidential needed, every PhD does own project
    35. 35. What do data producers say? 2/2 Very usefull, essential When can I store metadata often missing my datasets? Much to improve  in reuse of dataGood opportunity to share datasets we bought Would like to publish data Surprising our university had no Transfer of data between faciltity for data PhD’s can be improved preservation
    36. 36. Workshop with researchersData should only become available after publication
    37. 37. Workshop results• Confirmed: – Different domains have commonalities – Need for support on research data management exists• There are strong differences depending on – Research type – Data types – Individual attitudes
    38. 38. ‘Conclusions’ on valuable dataWhich data to preserve? And why?• Data of ‘enhanced publications’ (underlying data and visualisations linked to publications). Increase publication value (stronger basis, more citations, …);• Data generated by ‘hard to repeat’ processes. E.g. high cost, (environmental) observations, complex or continuous experiments, …;• Data collected with public funding. Conditions by funding organisations or publishers like Nature Publishing Group, NWO, governmental organisations, universities, …;• Preferably open access data with potential for reuse (verification, new research, …). Increase visibility, efficiency and quality of research efforts.• … Anything to add?
    39. 39. Docs vs. Data (Differences)• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections – Data category (experiment, model/simulation, observation) – Data generation process (man made vs. machine made or …) – File formats• Attitudes to ‘publishing’• Resources, expertise, efforts on data management• Selection inevitable (due to size)• Value of research data higher• Readability of research data is lower (zero without metadata• Citation practice• …• … Anything to add?
    40. 40. The EndIn one line:“Challenge is to find the ready, able and willing(researchers)”
    41. 41. To Dotmocracy…• 15 min. to select or define new propositions (approx. 3) and write them on a sheet.• 15 min. to ‘vote’on every sheet.• 15 min. for plenary discussion on opposing opinions.
    42. 42. Responsibility Propositions 1/4• All research data should be stored in disciplinary archives.• Research institutes must register data produced by their researchers.• Libraries are the best departments at universities to take on research data archiving.
    43. 43. Obligation Propositions 2/4• Data-producers should be obliged to publish their (anonymous) research data as open data.• High cost research facilities should be obliged to share (and preserve) their data.• Users should login to download data• Data-repositories should never accept data in proprietary file formats
    44. 44. Value Propositions 3/4• Only datasets which are linked to publications need to be preserved for the long term.• Not simulation results but algorithms and boundary conditions should be stored.• Each dataset should also include the data in its rawest form.
    45. 45. Misc. Propositions 4/4• University libraries have a harder job to attract datasets from exact sciences than from humanities.• Researchers are sloppy (they regard documentation as irrelevant and annoying).• Session #4 should be on the beach with lots of beer.
    46. 46. Docs vs. Data (Differences)• Object sizes (capacity)• Collection sizes/granularity (number of files)• Meta data (type, standards and distinction from object)• Heterogeneity of collections – Data category (experiment, model/simulation, observation) – Data generation process (man made vs. machine made or …) – File formats• Attitudes to ‘publishing’• Resources, expertise, efforts on data management• Selection inevitable (due to size)• Value of research data higher• Readability of research data is lower (zero without metadata• Citation practice• (A document is data)• Boundaries of data (sets) are less clear than for documents• Assigned responsibilities and tasks• Legal status• …
    47. 47. All Propositions 1/1• All research data should be stored in disciplinary archives.• Research institutes must register data produced by their researchers.• Libraries are the best departments at universities to take on research data archiving.• Data-producers should be obliged to publish their (anonymous) research data as open data.• High cost research facilities should be obliged to share (and preserve) their data.• Users should login to download data• Data-repositories should never accept data in proprietary file formats• Only datasets which are linked to publications need to be preserved for the long term.• Not simulation results but algorithms and boundary conditions should be stored.• Each dataset should also include the data in its rawest form.• University libraries have a harder job to attract datasets from exact sciences than from humanities.• Researchers are sloppy (they regard documentation as irrelevant and annoying).• Session #4 should be on the beach with lots of beer.
    48. 48. Dotmocracy results 1/3“Users should login to download data”Str. Agree Agree Neutral Disagree Str. Disagree xx xx x+ Should be for some data types (sensitive)+ It helps to get an idea of usage+ Anonymity(?) on the net is a ‘2000’ thoughtanyway+ Accept license+ Trace of use for data-producers- Raise threshold for re-use
    49. 49. Dotmocracy results 2/3“Data repositories should never accept files inproprietary formats”Str. Agree Agree Neutral Disagree Str. Disagree xxxxxx xxxxxx xxxxxx xx+ Easy to reuse data in open formats- Better to have proprietary data than none at all- May prelude data if insist on open format- Can be migrated to open formats (sometimes)
    50. 50. Dotmocracy results 3/3“Libraries are the best departments atuniversities to take on research data archiving”Str. Agree Agree Neutral Disagree Str. Disagreexx xxxxxxxxxxxx xx xxxxxx+ Co-operation already with researchers+ Librarians have good meta data skillso The library’s vendor should deliver the service(?)+ Full control and close to researcher(?)- Challenge to big: long term sustainability+ Builds on metadata knowledge of libraries- Must have IT in co-operation- Archiving skills
    51. 51. Responsibility• All research data should be / is best stored in disciplinary archives. – Bigger bodies of (mono)disciplinary data for consumers – Discipline specific meta data, guidelines and support – Sustainability of data-archive organisations – Research data ownership at research institutes• Research institutes should register data – …• Libraries are the best departments at universities to take on research data archiving. – Accessibility – Archiving – IT knowledge – Infrastructure
    52. 52. Obligations• Data producers / High cost facilities should be obliged to publish their (anonymised) research data. – Risk: “Garbage in …” – Funding consequences WOULD make a difference• Login/registration of data-consumers – Accept license – User statistics for archive funding – Trace of use for data-producers – Raise threshold for re-use• Data-repositories should refuse proprietary formats – …
    53. 53. Value• Only data linked to publications – Data can be measured faster than it can be analyzed – Accepted article proof of value AND documentation – Possible future value without present publication? – …• Not simulation results but algorithms – Software more difficult to authentically reproduce – Data calculation can be very time/resource consuming – Simulation datasets can be very large – Ability to calculate higher resolutions, faster is increasing – …• At least data in its rawest form – Interpretation (processing) might be done wrong – Interpretation (processing) only for super-experts and generally accepted – Raw data can be very large (PIV, IDRA, …) – …

    ×