Rdm slides march 2014


Published on

Slides for Research Design Graduate Course - http://courses.cornell.edu/preview_course_nopop.php?catoid=14&coid=190464#

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data, documentation and associated files (e.g. SAS, SPSS, Stata) are housed on the CISER file server. Files are downloaded from the catalog in ZIP compressed format..
    Cross-National Time Series data
  • As CISER is an ICPSR member, researchers can gain access to data held in those CESSDA Archives that are themselves ICPSR members
    CESSDA member organisations adhere to a Trans-border Data Access Agreement
  • European community household panel survey, European Union labour force survey, Community Innovation survey, European health Interview Survey, Structure of Earnings Survey, European Union Statistics on Income and Living Conditions
  • What about preserving?
  • Observational – sensor data, survey or sample data, neuroimages – e.g. ocean temperature, voters attitudes before an election, photographs of a supernova
    Experimental – e.g. gene sequences, chromatograms, toroid magnetic field data, HPLC, gel electrophoresis, chemical reaction rates,
    Simulation – e.g. climate models, economic models, algorithms
    Derived – e.g. text and data mining, compiled database, 3D models, maps
    Reference - e.g. gene sequence databanks, chemical structures, spatial data portals
  • Funded by JISC as part of its UK programme, Managing Research Data to develop online learning materials to assist researchers manage their digital assets.
    IAD – set up to deliver training and development for postgraduate students and staff – via online course, Virtual Learning Environments, transferable skills training
  • Shareable Content Object Reference Model – XML-based
  • Rdm slides march 2014

    1. 1. CISER Data Archive & Introduction to RDM Stuart Macdonald CISER Data Services Librarian srm262@cornell.edu Research Design CRP-7201, Stone Laboratory, Cornell Univ. 19 March 2014
    2. 2. • CISER Data Archive • What is Research Data Management (RDM) • Research Data Defined • Data Management Planning • Organising Data • File Formats & Transformations • Documentation & Metadata • Storage & Security • Data protection & Rights • Preservation & Sharing • Research Data MANTRA
    3. 3. CISER Data Archive: Collection and Services Established over 30 years ago Collection of numeric datasets to support quantitative research c. 27,000 online files in addition to thousands of studies on CD/DVD Emphasis on demography (state/federal censuses), economics, health, labor, election studies, attitudinal and behavioral studies, family life etc.
    4. 4. • Consulting services to match user needs with appropriate data and statistical analysis software •finding, accessing and using data • Current Cornell researchers can download archive files from online catalog (search & browse) in formats conversant with statistical software • Data files are identified by a ‘traffic light’ icon that indicates usage level: • Green – downloadable by anyone • Yellow – downloadable from links in the catalog with CUWebAuth authentication (for use within the CISER research computing environment - CISERRSCH) – Cornell researchers can apply for a computing account • Red – data to be used in restriction (via CRADC or conditions imposed by data provider)
    5. 5. CISER Data Catalog:
    6. 6. 6 CISER Data Archive maintain links to a range of social science data resources including: •Data Distributors and Producers: U.S. Government e.g. Dept. Agriculture, Dept. Commerce, Dept. Energy, Dept. Justice, Dept. Labor, Federal Agencies •Data Distributors and Producers: Other U.S. Sources •Data Distributors and Producers: International eg. Eurostat, FAOSTAT, ILO, OCED, UN Statistics Division, World Bank •Data Libraries and Archives e.g. Harvard-MIT Data Center, UKDA, DANS, CESSDA, •Social Science Research Institutes e.g. Odum Institute, Survey Research Institute •Online Reference Tools e.g. Boundary files, geocoding tools, SIC codes, data citation tools •State and Local Government data and statistical sources e.g. NY State Depts. Education, Health, Labor, State Data Center See URL: http://ciser.cornell.edu/ASPs/datasource.asp
    7. 7. • Provides Cornell social science researchers with a repository for sharing and providing long-term preservation of their numeric/statistical research data • Participates in Cornell’s Research Data Management Service Group • Assist Cornell social science researchers with Research Data Management (RDM) plans • Provide Cornell social science researchers with support and expertise in obtaining and using restricted data
    8. 8. Other social science research data resources: • Inter-University Consortium for Political and Social Research (ICPSR) • National Archive of Criminal Justice Data • Minority Data Resource Center • National Archive of Computerized Data on Aging • Roper Center for Public Opinion Archives • International Data Archives • CESSDA, UKDA, Eurostat • CESSDA catalog (DDI) provides a multi-lingual interface to datasets from member social science data archives across Europe • Non-Governmental Organizations • National / Governmental Statistical Agencies
    9. 9. • CISER Data Archive Catalog: http://ciser.cornell.edu/ASPs/search.asp • ICPSR: www.icpsr.umich.edu/ • Roper Center for Public Opinion Research: http://www.ropercenter.uconn.edu/ • CESSDA: http://www.cessda.org/ • Eurostat: http://www.epp.eurostat.ec.europa.eu/ URLs:
    10. 10. CISER Data Archive is located at 391 Pine Tree Road, Ithaca CISER is open 8.30am – 4.30pm (Mon-Fri) – walk-in assistance is not always available – so appointments are recommended Location & hours: Contacts: Tel.: (607) 255 4801 Email: ciser@cornell.edu
    11. 11. Introduction to Research Data Management (RDM)
    12. 12. Why Manage Research Data? Current research data management initiatives are based on three trends: The data deluge – exponential growth in volume of digital research artifacts created within academia (often created by publicly funded research) Data management is required by multiple disciplines Increasing perception of the value of data (data as commodity)
    13. 13. What is Research Data Management? • RDM is an umbrella terms to describe all aspects of planning, organising, documenting, storing and sharing research data. • It also takes into account issues such as documentation, data protection and confidentiality. • It provides a framework that supports researchers and their data throughout the course of their research and beyond. • It is one of the essential areas of responsible conduct of research
    14. 14. Research Data Lifecycle Pink Colored Umbrellas Are Pretty Darned Rainproof
    15. 15. Research Data Defined US Office of Management and Budget in its grants management circular A-110 defines research data as “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” The KRDS2 study (Beagrie et al, 2009) define research data as ‘collections of structured digital data from any disciplines or sources which can be used by academic researchers to undertake their research or provides an evidential record of their research.’ RIN Classification* • Observational – real-time, unique, usually irreplaceable • Experimental – from lab equipment, expensive, often reproducible • Simulation – generated from models – model & metadata are as important as output data • Derived – resulting from processing or combining “raw” data. reproducible but expensive • Reference - a (static or organic) collection of smaller (peer-reviewed) datasets, probably published and curated * Stewardship of digital research data: a framework of principles and guidelines, Research Information Network, 2008. URL: http://tinyurl.com/l56gftx
    16. 16. Research Data Defined • Research data, unlike other information types, is collected, observed, or created, for purposes of analysis to produce original research results. • Research data can be generated for different purposes and through different processes in a multitude of digital formats.
    17. 17. Research data comes in many varied formats: Text    Flat text files, Word, Portable Document Format (PDF), Rich‐ Text Format (RTF), Extensible Markup Language (XML). Numerical    SPSS, Stata, Excel.‐ Multimedia - jpeg, tiff, dicom, mpeg, quicktime. Models - 3D, statistical. Software - Java, C. Discipline specific - Flexible Image Transport System (FITS) in astronomy, Crystallographic Information File (CIF) in chemistry, Instrument specific - Olympus Confocal Microscope Data Format,Carl Zeiss Digital Microscopic Image Format (ZVI)
    18. 18. Research data may include the following: • Documents (text, MS Word), spreadsheets • Lab books, field notes, diaries • Questionnaires, transcripts, codebooks • Audiotapes, videotapes, photographs, images • Slides, artefacts, specimens, samples • Collection of digital objects acquired & generated during the research process • Database contents (video, audio, text, images) • Models, algorithms, scripts • Contents of an application (input, output, logfiles for analysis software, schemas) • Methodologies, workflows • SOPs, protocols
    19. 19. By managing your data you will: • ensure scientific integrity of research and aid replication • ensure research data and records are accurate, complete, authentic and reliable • increase your research efficiency • save time, effort and resources in the long run • enhance data security and minimise the risk of data loss • prevent duplication of effort by enabling others to use your data • meet funding grant requirements Note: It may also be important to manage research records (both digital & hardcopy) during and beyond the life of the project such as: correspondence (emails) grant applications technical reports research reports consent forms ethics applications
    20. 20. What Do Funders Want? • timely release of data - once patents are filed or on (acceptance for) publication. • data shared openly - minimal or no restrictions if possible. • preservation of data - typically 5-10+ years if of long-term value. • data management plans See : NIH Data Sharing Policy: https://grants.nih.gov/grants/policy/data_sharing/ NSF Data Sharing Policy: http://www.nsf.gov/bfa/dias/policy/dmp.jsp
    21. 21. Data Management Plan. What is it? Funding bodies require researchers to supply detailed, cost- effective plans for managing research data. These are called Data Management Plans A DMP is a document which describes:  What research data will be created.  What policies (funding, institutional, legal) apply to the data.  What data management practices (backups, storage, access control, archiving) will be used.  What facilities and equipment are equired (hard-disk space, backup server, repository).  Who will own the copyright and have access to the data.  How long-term preservation will be ensured after the original research is completed. The data management plan must be continuously maintained and kept up-to-date throughout the course of research.
    22. 22. Why do we need one? It improves your research both now and later... •Data is often valuable for a long time! •Results of your research may outlast your project. •Will you use your data throughout your career? •Prevents loss of digital data and records. •Prevents loss of usefulness through media and software obsolescence, •Forgetting stuff! Good practice Better research→
    23. 23. Why do we need one? •Ensure research integrity (and repeatability) through keeping better records. •People can trace your outcomes from data collection, through research methodology, through to results. •Maximises usefulness of data to fellow researchers. •Highlights how data was collected, quality controls, how people can and should use it (access and licensing). •Facilitates data use within collaboration. •Can help lead to subsequent research papers.
    24. 24. Getting started with a DMP  Gain an understanding of terminology & issues.  Gain understanding of your project/community – Supervisor and colleagues – People in your School, i.e. IT Officers, Research Coordinator/Administrator  Talk to your supervisor about data authorship, IP, licensing, policies.  Keep it practical and simple, don't spend too much time. What you don't know leave gaps, investigate, fill in later.  Remember it is never finished! Review it regularly through the course of your research. CDL’s DMP Tool: https://dmp.cdlib.org/ Cornell University RDM Services Group - Writing a DMP: https://confluence.cornell.edu/display/rdmsgweb/data- management-planning-overview
    25. 25. Questions?
    26. 26. Benefits of organising your data Research data files and folders need to be labelled and organised in a systematic way so that: •Data files are not accidentally overwritten or deleted •Data files are distinguishable from each other within their containing folder •Data file naming prevents confusion when multiple people are working on shared files •Data files are easier to locate and browse •Data files can be retrieved by both creator and by other users •Data files can be sorted in logical sequence •Different versions of data files can be identified •If data files are moved to other storage platforms their names will retain useful context
    27. 27. File Formats & Transformation • Files are based on either text or binary encoding. The former is both machine- and human-readable and the latter only readable by means of appropriate software. • Thus text files are less likely to become obsolete. Examples of file name extensions for these files are .txt, .csv and .por.  • Be aware of the file formats your data exists in – Does this format require a specific type of software? – Can others access the data in this format? – Can alternative formats be used? • Using widely available or open formats maximises the chances of your data being stable and usable
    28. 28. File Formats & Transformation •When compressing  your data files for storage or transportation you encode the information using fewer bits than the original representation. Commonly used compression programs are  Zip and Tar. •You may use the process of data normalisation. This means to convert data from one format (e.g. proprietary) into another for use or preservation (e.g. ASCII). •If you convert or migrate your data files from one format to another, be aware of potential risk of data loss or corruption and take appropriate steps to avoid/minimise it. •Watch out for backwards compatibility if software is upgraded
    29. 29. Exercise 1. Formatting your data
    30. 30. Documenting Data There are many reasons why you need to document your data: •To help you remember the details later •To help others understand your research •Verify your findings •Review your submitted publication •Replicate your results •Archive your data for access and re-use Some examples of data documentation are: •Laboratory notebooks •Field notes •Questionnaires
    31. 31. Documenting Data Research data need to be documented at various levels: •Project level •File or database level •Variable or item level The term metadata (‘data about data’) is often used. The importance of metadata lies in the potential for machine-to-machine interoperability to assist location and access to data through search interfaces.
    32. 32. Secure data storage: For the purposes of integrity and efficiency it is important that research data is stored securely & backed up regularly via: • Networked drives • Fileservers managed by department / school / IT Dept. • Stored in single, secure, accessible place – regular back-ups. • Personal computers / laptops • Convenient, temporary storage - should not be used for storing master copies. • Local drives may fail & laptops may get lost/stolen.
    33. 33. • External storage devices • Hard drives, USB sticks, CDs, DVDs – low cost & portable BUT not recommended for long term storage. • Longevity not guaranteed – degradation over time. • Easily damaged or misplaced. • Not big enough for all research data – might be need to use multiple discs/drives. • May pose a security threat. If USB sticks, DVDs, CDs are used for working data or extra back-up then: • Choose high quality products from reputable manufacturers. • Conduct regular checks to ensure media is not failing. • Periodically refresh data (i.e. copy to a new disc or drive). • Ensure confidential data is password protected / encrypted
    34. 34. • Remote or online back-up services – services that provides an online system for storing and backing-up computer files e.g. Dropbox, Mozy, Humyo, A-Drive • Allow users to store and sync data files online and between computers. • Employ cloud computing storage facilities (e.g. Amazon S3). • Business model – first few GBs free, pay for more space.
    35. 35. Backing-up Considerations for back-up policy: • Whether all data (full back-up), or only changed data will be backed-up (incremental back-up)? • How often full and incremental back-ups will be made? • How much hard-drive space or DVDs will be required to maintain this schedule? • If working with sensitive data, how will it be secured (and destroyed)? • What back-up services are available that meet your these needs? • Who will be responsible for ensuring back-ups are available? Recommendation: Keep at least 3 copies of your data (e.g. original, external/local, and external/remote) and put in place regular back-up procedure
    36. 36. Data Security The means of ensuring that data is kept safe from corruption and that access to it is suitably controlled. It is important to consider data security to prevent: • Accidental or malicious damage / modification to data. • Theft of valuable or irreplaceable data. • Breach of confidentiality agreements and privacy laws. • Release of data before it has been checked for accuracy and authenticity.
    37. 37. Exercise 2. Data storage and Security
    38. 38. Data Protection (also called data privacy) • In the US, there is no single, comprehensive federal (national) law regulating the collection and use of personal data. Instead, the US has a patchwork system of federal and state laws, and regulations that overlap, dovetail and may contradict one another. • The combination of an increase in cross-border data flow, together with the increased enactment of data protection statutes heightens the risk of privacy violations and creates a significant challenge for a data owner/distributor. Data protection is the relationship between: •collection and dissemination of data •technology •the public expectation of privacy and the legal and political issues surrounding them
    39. 39. Rights and access • Intellectual property rights (IPR) can be defined as rights acquired over any work created or invented with the intellectual effort of an individual. • Facts are not copyrightable but the structure of a database could be. • As a researcher, you should clarify ownership of and rights relating to research data before a project starts. This includes the right of access and the right to make copies. • Data licences determine the terms and conditions of use by another, and may accompany a purchase or subscription. • Open data licences attempt to “set data free” by minimising and standardising the terms and conditions of re-use. Conditions may include attribution, non-commercial use, no derivative works, or ‘share alike’.
    40. 40. Open Data Commons (ODC) have prepared a set of licences each with an accompanying statement which can be placed with your data on a webpage that points to your data. Open Data Commons: http://opendatacommons.org/
    41. 41. Benefits of Sharing Data • Scientific integrity – publishing & citing data in published research papers can allow others to replicate, validate, or correct results, thus improving the scientific record. • Publicly funded research - there is a growing movement for making publicly funded research available to the public. • Funding mandates - US Funding Agencies are increasingly mandating data sharing so as to avoid duplication of effort and save costs. • Preserve research data for researchers’ own future use.
    42. 42. Research Data MANTRA
    43. 43. Research Data MANTRA Partnership between: EDINA & Data Library, University of Edinburgh Institute for Academic Development Funded by JISC Managing Research Data Programme (Sept. 2010 – Aug. 2011) Aim was to develop online interactive open learning resources for PhD students and early career researchers that will: Raise awareness of the key issues related to research data management & contribute to culture change. Provide guidelines for good practice.
    44. 44. Eight units with activities, scenarios and videos: • Research data explained • Data management plans • Organising data • File formats and transformation • Documentation and metadata • Storage and security • Data protection, rights and access • Preservation, sharing and licensing Four data handling practicals: SPSS, NVivo, R, ArcGIS Video stories from researchers in variety of settings Online Learning Module
    45. 45. Online Learning Module • Delivered online – self-paced, available ‘anytime, anyplace’ • Emphasis on practical experience and active engagement via online activities • One hour per unit • Read and work through scenarios & activities (incl. videos etc) • CC licence to allow manipulation of content for re-use with attribution • Portable content in open standard formats (e.g. SCORM) • Research data MANTRA course: http://datalib.edina.ac.uk/mantra
    46. 46. Questions?