Introduction to Data Management

1,300 views
1,128 views

Published on

Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.

Published in: Education, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,300
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
70
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • The reason we’re here is to explain what data management is, why we think it should be important to you, provide you with some things to think about in managing your data and provide some resources for managing your data.

    This is a first, introductory workshop…. Invite questions and comments throughout.

    [Note: Unless otherwise noted, all images in this presentation are from Stock Exchange (http://www.sxc.hu) and copyright is retained by their owners.]
  • As we talk with you today, be thinking about how what we are telling you works with your own research process.
  • Image credit: Science Magazine
  • Does not include, “any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples).” This narrow definition mostly takes a retrospective view of your dataset, in that it does not account for raw and intermediate that may be critical to the research process but that don’t become part of the ’final’ dataset.

    Data could be:
    Observational
    Experimental
    Simulated
    Derived



  • …”Data may be viewed as the lowest level of abstraction from which information and knowledge are derived.”
  • Data management is a verb – it involves intentional effort and activity.
    The main goals of DM are preservation and reuse, for you and for others.
    Covers all aspects of the data lifecycle from planning digital data capture methods, whittling down, ingestion to databases, providing for access and reuse, to transformation.

  • Requirements: funder, journal, etc.
    Further your field: speed the pace of discovery by enhancing discovery, access and reuse
    research & knowledge base develops more rapidly
    Increase your visibility & impact: share your data, get credit with data citation
    Increase dataset value: More likely to be re-visited and re-used in ways that may be different from their original intention
    Integrity: data (documentation will preserve integrity of data over time) & you (achieve ethics compliance [HIPAA] and enhance your reputation via transparency)
    Allows for validation of your research
    Efficiency: avoid digital archeology, avoid duplication of effort
    Allows easier exchange with other scientists
    Preserve your lifelong body of work!
  • If we agree that data that are better managed are more easily, rapidly, and effectively shared, then we can also agree that better data management can enable and speed up discoveries.
  • 85 cancer microarray clinical trial publications with respect to the availability of their data. The 48% of trials with publicly available microarray data received 85% of the aggregate citations. Publicly available data was significantly (p = 0.006) associated with a 69% increase in citations, independently of journal impact factor, date of publication, and author country of origin using linear regression.

    Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308

    A [very] short bibliography on citation advantage:
    Henneken, E.A. & Accomazzi, A. Linking to Data - Effect on Citation Rates in Astronomy. (2011).
    Dorch, B. On the Citation Advantage of linking to data. (2012).
    Piwowar, H.A. & Vision, T.J. Data reuse and the open data citation advantage. PeerJ 1, e175 (2013)
    Sears, J.R. Data Sharing Effect on Article Citation Rate in Paleoceanography. AGU Fall Meet. Abstr. -1, 1628 (2011).
  • 22 February 2013: The Office of Science and Technology Policy in the White House released a memorandum about expanding pubic access to the results of federally funded research. In addition to scholarly publications, federal agencies are making serious efforts to increase the sharing of research data.

    All federal agencies with more than $100M in R&D expenditures are subject to this memo.

    http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf

  • All of these federal agencies sponsor work being conducted at OSU. Red $$ are federal agencies funding the most work at OSU; amounts shown are funding to OSU in FY2011-2012. [source: OSU Research Office]
  • So, we’ve agreed on what “data” means, and what data management is. Let’s talk about the large variety of data that you may obtain or create. The types of data that you use and/or create have a large impact on most of your data management activities.
  • Most research uses multiple data types & formats. My own dissertation included the generation or use of all of the data types that we’ve discussed (which might help to explain why it took me 7 years to earn a Ph.D.). http://hdl.handle.net/1957/9088

    Image credit: Document by Piotrek Chuchla from The Noun Project
  • http://en.wikibooks.org/wiki/Statistics/Different_Types_of_Data/Quantitative_and_Qualitative_Data

    More on qualitative data: ”Although we may have categories, the categories may have a structure to them. When there is not a natural ordering of the categories, we call these nominal categories. Examples might be gender, race, religion, or sport.
    When the categories may be ordered, these are called ordinal variables. Categorical variables that judge size (small, medium, large, etc.) are ordinal variables. Attitudes (strongly disagree, disagree, neutral, agree, strongly agree) are also ordinal variables, however we may not know which value is the best or worst of these issues. Note that the distance between these categories is not something we can measure.”

    More on quantitative data: “However, not all numbers are continuous and measurable. For example, the social security number is a number, but not something that one can add or subtract.”
  • See more on geospatial analysis here: http://en.wikipedia.org/wiki/Geospatial_analysis
  • If you don’t know yet – make an educated guess.

    Why is this important? The data types that you are creating and/or using have implications for data documentation, storage and backup, data citation & ethics or reuse, and preservation and sharing (basically, all aspects of data management.
  • Before I move on – questions about anything so far?
  • Common data management areas are:
    Planning
    Storage & backup
    Organization & naming
    Documentation & metadata
    Legal & ethical considerations
    Sharing & reuse
    Archiving & preservation


    [image source: http://upload.wikimedia.org/wikipedia/commons/2/22/Earth_Western_Hemisphere_transparent_background.png]
  • Things to consider:
    How much data?
    Resources needed
    supplies and services
    costs associated with them
    Roles & responsibilities
    Metadata
    project level (documentation)
    data level (description)
    Data formats
    Data storage & backup
    working data vs. archive vs. shared/discoverable
    Ethics & consent
    Copyright (open data)
    Sharing
    How? With who?
  • Quote from: UE RDM Kit

    So, let’s review a few real-world scenarios.
  • Hard drive failures happen ALL THE TIME, without notice. There’s a one-liner that goes something like, “There are two kinds of people in this world: those who have had a hard drive fail, and those who haven’t had a had hard drive fail…yet.” (ha ha)

    Personal example: tell the story about Maria’s hard drive failure, and expensive (several $K) partial recovery of her dissertation data. She lost a month’s worth of work.

    image credit: http://www.johnytech.com/wp-content/uploads/2012/09/booterrors.jpg
  • Something else that happens ALL THE TIME: your computer get stolen (Wiley, post-research cruise theft example). Or, you crash your bike and your computer goes flying (Angel).

    image credit: http://www.flickr.com/photos/jeffreywarren/299422173/
  • Lesson: offsite, redundant storage is strongly encouraged.
  • From my post-doctoral work: Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcion’s research vessel 1 km inland. The marine research station was left standing, but everything in it was destroyed (including our samples and data servers). Luckily, we had the data server replicated at OSU, so we only lost the physical samples that were in Chile. Think it can’t happen to you? Yeah, I thought the same thing.

    [photo credit: http://www.ocean-partners.org/attachments/473_Being%20towed.JPG]
  • Not from my personal experience: from the DataONE Data Stories project: https://notebooks.dataone.org/data-stories/the-best-laid-schemes-of-backups-and-redundancy/

    Basic synopsis of this story:

    A researcher had made regular backups of her data, 5 GB worth, to an external drive. Her computer hard drive failed, so she sent it to IT, had it repaired, and went to restore her data using what was on her external backup drive. Unfortunately, only 30% of the data were accessible from the drive. “The only thing they could determine was that the software used to back up the data was not compatible with the recovery process.”

    Lesson: “As it turned out, creating backups was not enough to completely guard against tragedy; the only way to make sure those backups could serve their purpose in the event of an emergency was to check them regularly and ensure the data they contained were recoverable.”
  • From NECDMC Module 4: Backup allows you to restore your data in the event that it is lost.

    A backup strategy is a plan for ensuring the accessibility of research data during the life of a project. What your strategy will be depends on the amount of data you are working with, the frequency that your data changes, and the system requirements for storing and rendering it.
  • From NECDMC Module 4: Would you need just the data files themselves, the software that created them, or customized scripts written for data analysis? Depending on your research project, you may want to perform full, differential, or incremental backups.
  • Before we can talk about backing up your data, we need to address where you can store it. We will talk about the advantages, disadvantages and longevity of each option. Then, we’ll move on to backup.
  • Some content from: UE RDM Kit
  • USB flash drives, Compact Discs (CDs) and Digital Video Discs (DVDs)

    If you choose to use CDs, DVDs and USB flash drives (for example, for working data or extra backup copies), you should:
    Choose high quality products from reputable manufacturers.
    Follow the instructions provided by the manufacturer for care and handling, including environmental conditions and labeling.
    Regularly check the media to make sure that they are not failing
    periodically 'refresh' the data (that is, copy to a new disk or new USB flash drive); every 2-5 years
    Ensure that any private or confidential data is password-protected and/or encrypted.

    Some content from: UE RDM Kit
  • For example:
    Fileservers managed by your research group, department or college
    Fileservers managed by Information Services (IS)

    image credit: http://www.flickr.com/photos/ketmonkey/3329372324/
    Some content from: UE RDM Kit
  • For example: DropBox, SkyDrive, Box, Mozy,

    Image credit: Cloud by Benni from The Noun Project
    Some content from: UE RDM Kit
  • See more info here: http://oregonstate.edu/helpdocs/software/google-apps-osu

    Fully encrypted
  • Local storage options:
    Personal computer
    External storage
    USB drive
    Drobo box
    Network server
    PI or department server located near you

    Remote storage options:
    Network server
    Run by Information Services, likely cloud-based
    Cloud storage
    Google Apps Suite (Drive, Docs, Forms, etc.)
    up to 30GB of data
    Variable access levels
    Login portal: http://oregonstate.edu/main/online-services/google-apps-for-osu
  • Using a thoughtful, consistent file-naming strategy is the easiest way to make a big difference in organizing your files, and making them easy to find (not just by you, but by others in your group and your adviser).

    [image credit: PhDComics.com]
  • Tweet me @AWhitTwit with your #overlyhonestmethods!
  • Be consistent
    Have conventions for naming: (1) Directory structure
    (2) Folder names
    (3) File names
    Always include the same information (e.g. date and time)
    Retain the order of information (e.g. YYYYMMDD, not MMDDYYY )

    Be descriptive
    Try to keep file and folder names under 32 characters

    Within reason, Include relevant information such as:

    Unique identifier (ie. Project Name or Grant # in folder name)
    Project or research data name
    Conditions (Lab instrument, Solvent, Temperature, etc.)
    Run of experiment (sequential)
    Date (in file properties too)
    Use application-specific codes in 3-letter file extension and lowercase: mov, tif, wrl
    When using sequential numbering, make sure to use leading zeros to allow for multi-digit versions. For example, a sequence of 1-10 should be numbered 01-10; a sequence of 1-100 should be numbered 001-010-100.
    No special characters: & , * % # ; * ( ) ! @$ ^ ~ ' { } [ ] ? < > -
    Use only one period and before the file extension (e.g. name_paper.doc
    NOT name.paper.doc OR name_paper..doc)
  • Show examples of versions
    Can go back when you make mistakes
    when changes are made
    Share work with other people
    Both work on things at the same time and merge back together
    Akin to game of telephone- version control can let you see exactly when a change was made
  • See Git, TortoiseSVN, Mercurial, …
  • Before I move on – questions about anything so far?
  • http://orcid.org – get an identifier for free!

    Here’s Amanda’s ORCID page: http://orcid.org/0000-0003-2429-8879

    Thanks to Jackie Wirz at OHSU for finding this perfect example!
  • Thanks to Jackie Wirz at OHSU for finding this perfect example!
  • PROJECT LEVEL:
    Context of data collection
    Data collection methods
    Structure, organization of data files
    Data sources used (see citing data)
    Data validation, quality assurance
    Transformations of data from the raw data through analysis
    Information on confidentiality, access & use conditions

    DATA LEVEL:
    Variable names, and descriptions
    Explanation of codes and classification schemes used
    Algorithms used to transform data
    File format and software (including version) used
  • [add content about licensing]

    Intellectual Property
    Office for Commercialization & Corporate Development (OCCD)
    Copyright
    Licensing
    Charging for data?
    Data attribution & citation

    HUMAN SUBJECTS?
    Informed consent & anonymization required prior to publishing
    Resources @ OSU:
    Office of Research Integrity, Institutional Review Board (IRB)
    Responsible Conduct of Research (RCR) Program
  • “Learn it. Know it. Live it.”, Brad Hamilton, from Fast Times at Ridgemont High.
  • How to share data:
    Repository: discipline-specific, journal, etc.
    Disciplinary Repositories
    DataBib: http://databib.lib.purdue.edu/
    Open Access Directory: http://oad.simmons.edu/oadwiki/Data_repositories
    OSU's digital repository, ScholarsArchive@OSU
    Personal web site
    By request

    Best practices:
    Well documented, w/standard metadata schema
    Non-proprietary file formats
  • How to share data:
    Repository: discipline-specific, journal, etc.
    OSU's digital repository, ScholarsArchive@OSU
    http://ir.library.oregonstate.edu
    Personal web site
    By request

    Best practices:
    Well documented, w/standard metadata schema
    Non-proprietary file formats

    There are storage options for various disciplines including discipline related repositories. Some scientific journals also have their own data repositories. OSU has an institutional repository, Scholars Archive that serves as a repository for University Scholarship. Currently we store data sets on a case-by-case basis. If you don’t know where to store your data, you may consult with the library or talk to people in your department and/or your field to see if there is a logical repository for your work.

  • What is a data paper? http://guides.library.oregonstate.edu/data-management-data-papers-journals

    “Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).”

    “Data papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.).”
  • Your data is being managed whether you realize it or not, but in order to preserve data and make the most of it, researchers should actively plan their data management process. A data management plan is a document that provides a clear roadmap of the process of collecting, storing, sharing and preserving your data. The earlier in the process a researcher starts in the data management process, not only makes it easier for her, but many repositories have requirements about how the data has been formatted etc. . . And it may not be possible to meet those requirements retroactively.


    [sample DMP text from: http://libguides.unm.edu/loader.php?type=d&id=194916]
  • These are the five sections for the generic NSF data management plan, but they cover what is generally discussed in any good data management plan.

    Things to consider:
    How much data?
    Resources needed
    supplies and services
    costs associated with them
    Roles & responsibilities
    Metadata
    project level (documentation)
    data level (description)
    Data formats
    Data storage & backup
    working data vs. archive vs. shared/discoverable
    Ethics & consent
    Sharing
    When?
    How?
    With who?

  • Create ready to use plans
    Meet agency requirements
    Step-by-step guidance
    OSU-specific guidance
  • Please feel free to contact Amanda Whitmire or Maura Valentino with any data management questions you might have or questions about depositing your research to ScholarsArchive@OSU digital repository, especially research data associated with published research articles and/or your thesis or dissertation.
  • As humans, we aren’t designed to interpret columns of numbers, or to be able to extract meaning from a table with a passing glance. As researchers and practitioners, we need to transform our data into something more meaningful.
  • We need to be the ambassadors of our data, to represent it in a way that speaks to why you made the effort to collect it and what you expect to learn from it. Sadly, most researchers, the large majority, never receive any training in how to share and communicate about their data, even to their peers.
  • Get the whole lifecycle graphic here: “Data management throughout the research lifecycle. Amanda Whitmire. figshare. http://dx.doi.org/10.6084/m9.figshare.774628”
  • OSU Research Data Services can help with data management during most of the research lifecycle phases. Examples include:
    Data management plan consultation
    Finding & accessing secondary data
    Data description & metadata
    Ways to share your data within your project team
    How to prepare data for archiving and where to put it
    How to license your datasets
    How to cite your datasets (& help others cite it properly)
  • Could be:
    Sensor output
    Audio visual recording
    Lab/field notes
    GPS coordinates
    Survey results
    Interview notes

    Characteristics of raw data:
    Raw units (volts, e.g.)
    No [extra] metadata
    No context

    How to manage raw data:
    Multiple copies in multiple locations
    DON'T modify them
  • Characteristics of intermediate data
    Data that are in-process
    Anything between ‘raw’ and ‘final’
    Multiple stages, multiple formats, using various software
    May be shareable w/colleagues, but not likely w/wider audience

    Managing intermediate data
    Multiple copies (of most recent versions) in multiple locations
    Use sensible folder hierarchies & file naming conventions
    Add metadata immediately & then additionally as you make changes

  • Characteristics of final data:
    May be a subset of raw data, or a much larger dataset
    Shareable with anyone
    Non-proprietary formats are ideal
    Metadata are thorough & complete

    Managing final data:
    Multiple copies in multiple locations
    Refresh data every 2-5 years
    If you use a proprietary format, migrate data to new version regularly
    Share via an open repository


    [image copyright: http://www.flickr.com/photos/usnavy/]
  • Image credit: Stock X.chnge
  • Image credit: OSU media services
  • For example, in climate science, HUGE amounts of data are compiled: http://www.realclimate.org/index.php/data-sources/

    In the example image, hydrographic data from all over the world are compiled into a database, and new variables can be derived from them. For example, depth profiles of seawater density from measurements of pressure, temperature, and salinity (via conductivity measurements).

    Image credit; http://cdiac.ornl.gov/oceans/RepeatSections/CDIAC_CLIVAR_global_map.jpg
  • Image credit: http://media.oregonlive.com/pacific-northwest-news/photo/gs61tsun103jpg-730687c0e8a35305.jpg

    This model of tsunami driftage is based on ocean topography and surface winds. See: http://iprc.soest.hawaii.edu/research/research.php, research by N. Maximenko.
  • Image credit: http://upload.wikimedia.org/wikipedia/commons/6/66/1930_census_Olsen_Schmidt.gif

  • Anticipate:
    Volume/File type(s)
    Raw data vs. processed/analyzed data
    Versioning
    File Naming Conventions
    Privacy Concerns
    Storage practice
    Backup plans; LOCKSS; checksums

    Storage needs will depend on the volume and type of data in question. A plan should anticipate the particular storage needs of your research. How much data do you expect to collect and how will storage infrastructure be funded?

    How will files be arranged on disk. This may seem like a trivial issue, but it’s important. If the person who has been responsible for maintaining files leaves, you need to be able to find things. You’ll also want to group files meaningfully with their related metadata.

    Data are, of course, subject to loss and corruption. (Bonus joke: there are two types of computer users… those who have suffered hard drive failure and those who haven’t suffered hard drive failure…yet). Will there be multiple copies—on-site and off-site? Are there checks against file corruption?

    Will you use specialized software or services to manage these aspects of the process?

    You should know which data you intend to retain locally and for how long. Who will be responsible for this after the main project period is over?

    Will you deposit in a preservation repository? Different repositories may have different preservation policies which you should be aware of.

    It’s also necessary to consider storage throughout the lifecycle. How will your data make it from the collection point, into your local storage, into backup, then into long term preservation? The systems involved may not be capable of interoperating without some planning at each stage.
  • Lesson: offsite, redundant storage is strongly encouraged.

    1 on your local workstation
    1 local/removable, such as external hard drive
    1 on central server and/or
    1 remote, such as on a cloud server*
    *Depending on the type of data, as cloud servers are not always secure

  • Following the Chile earthquake in 2010 (magnitude 8.8), a tsunami devastated the coastal town of Dichato, and moved the Universidad de Concepcion’s research vessel 1 km inland. The marine research station was left standing, but everything in it was destroyed (including our samples). Think it can’t happen to you? Yeah, I thought the same thing.

    [photo credit: http://www.ocean-partners.org/attachments/473_Being%20towed.JPG]
  • ~back to Amanda~

    NOTE: 15 October 2013: the software upgrade in this classroom to Office 2013 introduced a bug and subsequent lack of functionality for Colectica. Likewise, DataUP is currently throwing a bug for unknown reasons. I am in contact with both entities, but the bugs could not be fixed by today. Our apologies!

    DataUp is “an open source tool helping researchers document, manage, and archive their tabular data, DataUp operates within the scientist's workflow and integrates with Microsoft® Excel. http://dataup.cdlib.org/ (EML metdata)

    Colectica for Excel allows documenting of Variables, Code Lists, and Data Sets directly from within Microsoft Excel. Export your data documentation to an XML file in the DDI metadata format, the standard for data documentation. Open and edit it from Colectica Designer, Colectica Express, or other DDI applications. http://www.colectica.com/software/colecticaforexcel (DDI metadata)
  • Introduction to Data Management

    1. 1. Amanda L. Whitmire Steven Van Tuyl OSU Libraries
    2. 2. GOAL: Achievable habits for implementing data management best practices into your workflow 1
    3. 3. What are data?
    4. 4. “…the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” Research data is: U.S. Office of Management and Budget, Circular A-110 3
    5. 5. “Unlike other types of information, research data are collected, observed, or created, for the purposes of analysis to produce and validate original research results.” University of Edinburgh MANTRA Research Data Management Training, ‘Research Data Explained’ Research data is:
    6. 6. Actions that contribute to effective storage, preservation and reuse of data and documentation throughout the research lifecycle. Data management: 5
    7. 7. 6 What data management is not: Data/computational science Database administration A research method: • what data to collect • how to collect them • how to design an experiment
    8. 8. Increase visibility & impact Protects investment Increases research efficiency Preservation Saves time Funding agency requirements Why data management? Further your field image credit: http://www.flickr.com/photos/karenchiang/5041048588/
    9. 9. Further your field
    10. 10. 85 cancer microarray clinical trial publications 69% increase in citations for articles w/publicly available data Increase visibility & impact
    11. 11. Funder mandates “…directed Federal agencies with more than $100M in R&D expenditures to develop plans to make the published results of federally funded research freely available to the public within one year of publication and requiring researchers to better account for and manage the digital data resulting from federally funded scientific research.”
    12. 12. Which agencies are affected? $54M $35M $23M
    13. 13. Aspects of data management DMPs/Planning Storage & backup File organization & naming Documentation & metadata Legal/ethical considerations Sharing & reuse Archiving & preservation
    14. 14. Data types & formats
    15. 15. Data types 14 Observational | Can’t be recaptured, recreated or replaced; Examples: sensor readings, sensory (human) observations, survey results Experimental | Should be reproducible, but can be expensive; Examples: gene sequences, chromatograms, spectroscopy, microscopy, cell counts Derived or compiled | Reproducible, but can be very expensive; Examples: text and data mining, compiled database, 3D models Simulation | Models and metadata, where the input can be more important than output data; Examples: climate models, economic models, biogeochemical models Reference/canonical | Static or organic collection [peer-reviewed] datasets, most probably published and/or curated; Examples: gene sequence databanks, chemical structures, census data, spatial data portals
    16. 16. Amanda’s dissertation Thespectralbackscatteringpropertiesofmarineparticles Observations ship-based sampling & moored instruments Simulation results scattering & absorption of light Experimental optical properties of phytoplankton cultures Derived variables endless things Compiled observations global oceanic bio- optical observations [self + from peers] Reference global oceanic bio- optical observations [NASA]
    17. 17. Data types: another classification Qualitative data “is a categorical measurement expressed not in terms of numbers, but rather by means of a natural language description. In statistics, it is often used interchangeably with "categorical" data.” See also: nominal, ordinal Source: Wikibooks: Statistics Quantitative data “is a numerical measurement expressed not by means of a natural language description, but rather in terms of numbers. However, not all numbers are continuous and measurable.” “My favorite color is blue-green.” vs. “My favorite color is 510 nm.”
    18. 18. More common data types Geospatial data has a geographical or geospatial aspect. Spatial location is critically tied to other variables. Digital image, audio & video data Documentation & scripts Sometimes, software code IS data; likewise with documentation (laboratory notebooks, written observations, etc.)
    19. 19. File Formats “A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium.” - Wikipedia Qualitative, tabular experimental data Data type Excel spreadsheet (.xlsx) Comma-delimited text (.csv) Access database (.mdb/,accdb) Google Spreadsheet SPSS portable file (.por) XML file Possible formats
    20. 20. What types & formats of data will you be generating and/or using? Observational | Experimental | Derived | Compiled | Simulation | Reference/Canonical Qualitative | Quantitative | Geospatial Image/audio/video | Scripts/codes Reflective Writing: 60 seconds
    21. 21. 20 ? ? ??
    22. 22. The “big picture” 21
    23. 23. Make a plan Where do you start? 22
    24. 24. Data storage & backup
    25. 25. “Your data are the life blood of your research. If you lose your data recovery could be slow, costly or even worse… it could be impossible.”
    26. 26. Most common loss scenario: drive failure
    27. 27. This happens a lot: physical theft & unintentional damage Cute, but not a valid security plan.
    28. 28. University of Southampton, School of Electronics and Computer Science, Southampton, UK, 2005
    29. 29. 28 It CAN happen to you
    30. 30. Real-world lesson: Audit your backups…
    31. 31. Data backup “Keeping backups is probably your most important data management task.” -Everyone 1. Some data backup is better than none. 2. Automated backups are better than manual. 3. Your data are only as safe as your last backup.
    32. 32. Data backup Original (working) External Local External Remote Best Practice: 3 Copies of datasets
    33. 33. Data storage options 1. Personal computers (PCs) & laptops 2. External storage devices 3. Networked Drives 4. Cloud servers
    34. 34. Storage: PC/laptop Advantages Convenient Disadvantages Drive failure common Laptops: susceptible to theft & unintentional damage Not replicated Bottom Line Do NOT use to store master copies of data Not a long term storage solution Back up important data & files regularly
    35. 35. Storage: external storage devices Advantages Convenient, cheap & portable Disadvantages Longevity not guaranteed (e.g. Zip disks) Errors writing to CD/DVD are common Easily damaged, misplaced or lost (=security risk) May not be big enough to hold all data; multiple drives needed Bottom Line Do NOT use to store master copies of data Not recommended for long-term storage
    36. 36. Storage: networked drives Advantages Data in single place, backed up regularly Replicated storage not vulnerable to loss due to hardware failure Secure storage minimizes risk of loss, theft, unauthorized use Available as needed (assuming network avail.) Disadvantages Cost may be prohibitive; export control Bottom Line Highly recommended for master copies of data Recommended for long-term storage (~5 years)
    37. 37. Storage: cloud storage Advantages Data in single place, backed up regularly Replicated storage not vulnerable to loss due to hardware failure Secure storage minimizes risk of loss, theft, unauthorized use Disadvantages Cost may be prohibitive Upload/download bottleneck & fees Longevity? Export control Bottom Line Possibly recommended for master copies of data Not recommended for in-process data, large files
    38. 38. Storage: Google Drive for OSU Advantages All same advantages of network & cloud storage File sharing & collaboration w/variable access levels Unlimited storage (GD), 30 GB non-GD Automatic version control on GD Disadvantages 30 GB may not be enough Upload/download bottleneck Bottom Line Possibly recommended for master copies of data Possibly not recommended for in-process data, large files
    39. 39. Data organization 38 AGU presentation Class presentation OS presentation Presentations Ocean Sciences Class AGU
    40. 40. Data storage options 39 Local Remote Computer External storage Network server Network server Cloud storage Google Apps Box, Dropbox, etc.
    41. 41. File-naming conventions
    42. 42. File naming strategy? 41
    43. 43. #OverlyHonestMethods 42
    44. 44. File naming conventions 43 Project_instrument_location_YYYYMMDDhhmmss_e xtra.ext Index/grant conditions Leading zero! s/n, variable Retain order
    45. 45. File naming strategies 44 Order by date: 1955-04-12_notes_MassObs.docx 1955-04-12_questionnaire_MassObs.pdf 1963-12-15_notes_Gorer.docx 1963-12-15_questionnaire_Gorer.pdf Order by subject: Gorer_notes_1963-12-15.docx Gorer_questionnaire_1963-12-15.pdf MassObs_notes_1955-04-12.docx MassObs_questionnaire_1955-04-12.pdf Order by type: Notes_Gorer_1963-12-15.docx Notes_MassObs_1955-04-12.docx Questionnaire_Gorer_1963-12-15.pdf Questionnaire_MassObs_1955-04-12.pdf Forced order with numbering: 01_MassObs_questionnaire_1955-04-12.pdf 02_MassObs_notes_1955-04-12.docx 03_Gorer_questionnaire_1963-12-15.pdf 04_Gorer_notes_1963-12-15.docx
    46. 46. Version control 45
    47. 47. Version control 46
    48. 48. 47 ? ? ??
    49. 49. Random suggestion 48
    50. 50. Disambiguate yourself 49 Open Researcher & Contributor ID John L. Campbell Forest Research Ecologist Oregon State University, Corvallis, OR John L. Campbell Forest Research Ecologist Center for Research on Ecosystem Change US Forest Service, Durham, NC
    51. 51. 50
    52. 52. Another Doppelgänger 51 Chris Langdon Studies ocean acidification OSU HMSC Chris Langdon Studies ocean acidification University of Miami
    53. 53. Metadata
    54. 54. What is metadata? • Data about data • Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. NISO, Understanding Metadata 53
    55. 55. Metadata 54 “The metadata accompanying your data should be written for a user 20 years into the future -- what does that person need to know to use your data properly? Prepare the metadata for a user who is unfamiliar with your project, methods, or observations.” Oak Ridge National Laboratory Distributed Active Archive Center for Biogeochemical Dynamics (ORNL DAAC)
    56. 56. Metadata in real life You use it all the time…
    57. 57. Darwin Core | biological diversity, taxonomy Dublin Core | general DDI (Data Documentation Initiative) | social & behavioral sciences DIF (Directory Interchange Format) | environmental sciences EML (Ecological Metadata Language) | ecology, biology ISO 19115 | geographic data Major metadata standards 01101101 01100101 01110100 01100001 01100100 01100001 01110100 01100001
    58. 58. Metadata examples Santa Barbara Coastal Long Term Ecological Research (LTER) web link Bureau of Labor Statistics, Consumer Price Index, 1913-1992 web link 57
    59. 59. Legal & ethical considerations 58
    60. 60. Data sharing & reuse “...digitally formatted scientific data resulting from unclassified research supported wholly or in part by Federal funding should be stored and publicly accessible to search, retrieve, and analyze.” Office of Science and Technology Policy The White House 59
    61. 61. How to preserve & share data 60
    62. 62. How to preserve & share data 61
    63. 63. Data journals 62
    64. 64. Data management plans 63
    65. 65. What is a data management plan? 64 …Physical specimens consist of carbonate biothems and filtered water samples are stored in Spilde or Northup’s labs until completion of analyses. At this time they are returned to the federal agency for museum curation or destroyed during analysis. Field notes will be scanned into pdf files, with a copy sent to Carlsbad Caverns National Park or other federal cave manager. SEM images will be saved in the tif format. …
    66. 66. Sections of a data management plan 65 1. Types of data 2. Data & metadata standards 3. Archiving & preservation 4. Sharing (access provisions) 5. Transition from collection to reuse See resources on your handout + use DMPTool
    67. 67. | Visit DMPTool Your go-to resource for data management plans 66
    68. 68. Contact information 67 Amanda Whitmire | Data Management Specialist amanda.whitmire@oregonstate.edu Steve Van Tuyl| Data and Digital Repository Librarian steve.vantuyl@oregonstate.edu http://bit.ly/OSUData
    69. 69. end 68
    70. 70. 69 Extra / Outdated Slides
    71. 71. Data does not speak for itself…
    72. 72. YOU speak for YOUR data
    73. 73. But first, you need to manage it 72
    74. 74. 73 Data management in the lifecycle
    75. 75. How can OSU Libraries help? 74 ✔ ✔ ✔ ✔✔ ✔
    76. 76. Copyright – who owns the data? Copyright applies to text, images, recordings, videos etc. in a digital form, in the same way that it would to analog versions of those works. The code of computer programs (both the human readable source code and the machine readable object code) is protected by copyright as a literary work. Data compilations such as datasets and databases can be protected by copyright in the literary works category, which includes ‘tables’ or ‘compilations’. A table or compilation, consisting of words, figures or symbols (or a combination of these) is protected if it is a literary work and has the required degree of originality. 75
    77. 77. Types, formats & stages of data 76 Raw data
    78. 78. 77 Types, formats & stages of data Intermediate data
    79. 79. Types, formats & stages of data 78 Final data
    80. 80. Data Types: Observational • Captured in situ • Can’t be recaptured, recreated or replaced Examples: Sensor readings, sensory (human) observations, survey results
    81. 81. Data Types: Experimental • Data collected under controlled conditions, in situ or laboratory-based • Should be reproducible, but can be expensive Examples: gene sequences, chromatograms, spectroscopy, microscopy, cell counts
    82. 82. Data Types: Derived or Compiled • Compiled: integration of data from several sources (could be many types) • Derived: new variables created from existing data • Reproducible, but can be very expensive Examples: text and data mining, compiled database, 3D models
    83. 83. Data Types: Simulation • Results from using a model to study the behavior and performance of an actual or theoretical system • Models and metadata, where the input can be more important than output data Examples: climate models, economic models, biogeochemical models
    84. 84. Data Types: Reference/Canonical Static or organic collection [peer-reviewed] datasets, most probably published and/or curated. Examples: gene sequence databanks, chemical structures, census data, spatial data portals
    85. 85. Data storage & preservation 84
    86. 86. About backups… 85 University of Southampton, School of Electronics and Computer Science Southampton, UK, 2005 Do them.
    87. 87. Plan for unexpected events 86
    88. 88. Metadata demonstration 87 Colectica

    ×