Have you ever collected data and had trouble remembering what you did at the start?Tried to share your data with someone and they (or you) couldn’t understand itUsing “Best Practices” when you collect and record your data will improve future usability and may save time.Preparing your data using these “Best Practices”Following these best practices (guidelines) will help you Following these best practices will improve the usability of the data by you or by others … use it with other data.
Spreadsheets are widely used for simple analyses They are easy to use BUT They allow (encourage) users to structure data in ways that are hard to use with other softwareYou can use them like Word, with columns. These spreadsheets (in this format) are good for “human” interpretation, not computers – and since you probably will need either Write a program or use a software package, then the “human” format is not best.These formats are good for presenting your findings such as publishing…. But it will be harder to use with other software later on (if you need to do any analysis).It is betterto store the data in ways that it can be used in automated ways, with minimal human intervention
This is some well data measurements, where a salinity meter was used to measure the salinity (top and bottom) and the conductivity (Top & bottom)Take a look at this spreadsheet… What’s wrong with it?Could this be easily automated? Sorted?
Dates are not stored consistentlySometimes date is stored with a label (e.g., “Date:5/23/2005”) sometimes in its own cell (10/2/2005)Values are labeled inconsistentlySometimes “Conductivity Top” others “conductivity_top”For Salinity sometimes two cells are used for top and bottom, in others they are combined in one cellData coding is inconsistentSometimes YSI_Model_30, sometimes “YSI Model 30”---- sort of can’t tell if it’s a “label” or a data valueTide State is sometimes a text description, sometimes a numberThe order of values in the “mini-table” for a given sampling date are different“Meter Type” comes first in the 5/23 table and second in the 10/2 table
Confusion between numbers and textFor most software 39% or <30 are considered TEXT not numbers (what is the average of 349 and <30?)Different types of data are stored in the same columnsMany software products require that a single column contain either TEXT or NUMBERS (but not both!)The spreadsheet loses interpretability if it is sortedDates are related to a set of attributes only by their position in the file. Once sorted that relationship is lost.Not sure why you would sort this.
The spreadsheet loses interpretability if it is sortedDates are related to a set of attributes only by their position in the file. Once sorted that relationship is lost.Look what happens when we sort this….Look at the difference in this one… sort it..https://docs.google.com/spreadsheet/ccc?key=0Att-cHR6O7gCdEZ2NzRhUWFLYy1nM2FMcDhaNGRVeWchttps://docs.google.com/spreadsheet/ccc?key=0Att-cHR6O7gCdHpTMC1kdWREbTNlanBwM3J5WVE3ZFE
Standard convention for many software programs (usually a “check” yes,no) is for the 1st line (record) to be a header line… lists the names of variables in the file. Rest of records (lines) are data.Not too long some software programs may not work with long variable names
We’ve seen that a spreadsheet or word processor can create datasets that can only be interpreted by human interventionThe “ugly spreadsheet” example would be hard to analyze even in a spreadsheet, except with lots case-by-case human decisionsBut what are some principles that characterize good archival data?Keep in mind that good data formats for data and sharing may not be the ones you prefer for viewing or analysis!Same naming convention for text data – use a vocabulary, keep same… “slack-high”…. Not “slack high”
There are already standards for certain types of data (like date/time, spatial coords). Use them, don’t invent your own.Can you think of others?(am/pm NOT allowed) T appears literally in the string. Min. for date is YYYY.YYYY = four-digit yearMM = two-digit month (01=January, etc.) DD = two-digit day of month (01 through 31)hh = two digits of hour (00 through 23)mm = two digits of minute (00 through 59)ss = two digits of second (00 through 59) s = one or more digits representing a decimal fraction of a second TZD = time zone designator (Z or +hh:mm or -hh:mm) Vs. DMS degree minutes seconds important when data field could have more than one type of unit.
Guidelines for filenames will only help you with your files/research. Once they are “archived” they will get new names that fit with the systems, usually a permanent name based on computer “locating” the file.Look at the file names……Context.txt, DataFile1.txt, DataFile2.txt, word6doc.zipLong ones….Safari, Ray… good date, placeNote “_” and “-” Think about how the name will look in a directory with lots of other files, want to be able to “pick it out”.
File names easiest way to indicate the contents of the file, use terse but indicative of their content. Want to uniquely id the data file.Don’t’ make them too long, some scripting programs have a filename limit for file importing (reading)Don’t use blanks, some software may not be able to read file names with blanks.Think about how the name will look in a directory with lots of other files, want to be able to “pick it out”.
Maybe use version numbers…. Don’t forget the extension (3 char.) used to tell the file type
Data Quality control takes place at various stages during data collection, data entry, and data checking. The quality of the collection methods has direct correlation to the quality of the data.Quality of data collection methods used has a significant bearing on data quality.Quality includes: equipment calibration (use instrument calibration to check precision) allows other researchers to look at your data and compare to theirs need to validate transcriptionTrain coders (different people doing this) – create handbook.Can create (program) data entry interfaces and verify data entry, use lists to choose fromVerification: out-of range values, random samples, double checking entriesMinimize manual entryVisual Basic can create forms for Excel. Access form creationRandom sample of dataConsistency checkseach record is keyed in and then re-keyed against the original. Several standard packages offer this feature. In the re-entry process, the program catches discrepancies immediately. Start before data collection, define standards – document in handbook
Don’t want to change something (or delete something) that could be important later.If use a scripted language you could re-run analyses
Analysis “scripted” software: R, SAS, SPSS, MatlabAnalysis scripts are written records of the various steps involved in processing and analyzing data (sort of “analytical metadata”).Easily revised and re-executed at any time if needs to modify analysisVS. GUI (easier) but does not leave a clear accounting of exactly what you have doneDocument scripted code with comments on why data is being changed.
Important to repeat!!!!More documentation: Documentation can also be called metadataDescription of the data file names (especially if using acronyms and abbreviations).Record why you are collecting data, Details of methods of analysisNames of all data and analysis filesDefinitions for data (include coding keys)Missing value codesUnit of measures.Structured metadata (XML) format standards for discipline (Ecological Metadata language – EML)
Can also be called metadataDescription of the data file names (especially if using acronyms and abbrevs.Record why you are collecting data, Details of methods of analysisNames of all data and analysis filesDefinitions for data (include coding keys)Missing value codesUnit of measures.Calibrations so others can compare their results with yours.Structured metadata (XML) format standards for discipline (Ecological Metadata language – EML)
Spreadsheets are widely used for simple analysesBut they have poor archival qualities Different versions over time are not compatibleFormulas are hard to capture or displayPlan what type of data you will be collecting. Want to choose a file format that can be read well into the future and is independent of software changes.These are formats more likely to be accessible in the future. to replace old media, maintaining devices that can still read the proprietary formats or media typeFormat of the file is a major factor in the ability to use the data in the future. As technology changes, plan for software and hardware obsolescence. System files (SAS, SPSS) are compact and efficient, but not very portable. Use software to “export” data to a portable (or transport) file. Convert proprietary formats to non-proprietary. Check for data errors in conversion.
Examples of preferred format choicesFormats for long-term digital preservation (open). Don’t expect you (won’t have time) or the archive to be able to convert older formats to new one.
Remember create spreadsheet so it can be automated2. Date/Time standards, Geospatial coords, Species, other standards from discipline3. Descriptive File Names – File names can help id what’ inside 4. Quality Assurance – when planning on data entry can “program” data checks in forms (Access and Excel), create pick lists (codes), missing data values5. Make it easier to replicate data transformation, can be documented6. Document EVERYTHING, dataset details, database details, collection notes – conditions, You will not remember everything 20 years from now! What someone would need to know about your data to use it.7. Stable File Formats – easier if all files were same format, also knowing what formats are better in the long-term
Best PracticesCreating Research Data Sherry Lake July 31, 2012 University of Florida Data Management Workshop
WHY?Following these Best Practices…….• Will improve the usability of the data by you or by others• Your data will be “computer ready”• Your data will be ready to share with others
Problems• Dates are not stored consistently• Values are labeled inconsistently• Data coding is inconsistent• Order of values are different
Problems• Confusion between numbers and text• Different types of data are stored in the same columns• The spreadsheet loses interpretability if it is sorted
Best Practices Data Organization• Lines or rows of data should be complete – Designed to be machine readable, not human readable (sort)
Best Practices Data Organization• Include a Header Line 1st line (or record)• Label each Column with a short but descriptive name – Names should be unique – Use letters, numbers, or “_” (underscore) – Do not include blank spaces or symbols (+ - & ^ *)
Best Practices Data Organization• Columns of data should be consistent – Use the same naming convention for text data• Columns should include only a single kind of data – Text or “string” data – Integer numbers – Floating point or real numbers
Use Standardized Formats• ISO 8601 Standard for Date and Time – YYYYMMDDThh:mmss.sTZD 20091013T09:1234.9Z 20091013T09:1234.9+05:00• Spatial Coordinates for Latitute/Longitude – +/- DD.DDDDD -78.476 (longitude) +38.029 (latitude)
File Names• Use descriptive names• Not too long• Don’t use spaces• Try to include time, place & theme• May use “-” or “_”
File Names• String words together with Caps (VegBiodiv_2007)• Think about using version numbers• Don’t change default extensions (txt, jpg, csv,…)
Quantitative Assurance/ControlDataset Creation & Integrity Errors • Use a data entry program – Program to catch typing errors – Program pull-down menu options • Perform double entry of the data • Manually check 5 – 10% of data records • Check for out-of-range values (plotting) • Check for missing or impossible values • Perform statistical summaries (random samples)
Analyzing Data - Notes• Keep Original File – Uncorrected copy – Make “read-only”• Make notes on transformations• Any changes, save as a new file• Use scripted code to transform and correct data
Analyzing Data• Use a scripted program (R, SAS, SPSS, Matlab) – Steps are recorded in textual format – Can be easily revised and re-executed – Helps sharing and repetition – Easy to document• GUI-bases analysis may be easier, but harder to reproduce
Document EVERYTHING!• Create a Project Document File – More than a Lab Notebook – Data Management Plan• Start at the beginning of the project and continue throughout data collection & analysis – Why you are collecting data – Exact details of methods of collecting & analyzing
Document EVERYTHING!• Details such as: – Names of data & analysis files associated with study – Definitions for data and codes (include missing value codes, names) example – Units of measure (accuracy and precision) – Standards or instrument calibrations
Choosing File Formats• Accessible Data (in the future) – Non-proprietary (software formats) – Open, documented standard – Common, used by the research community – Standard representation (ASCII, Unicode) – Unencrypted & Uncompressed – Media formats (hardware formats)
Preferred Format Choices• PDF, not Word• ASCII, not Excel• MPEG-4, not Quicktime• TIFF or JPEG2000, not GIF or JPG• XML or RDF, not RDBMSGood if not software specific
Best Practices1. Use Consistent Data Organization2. Use Standardized Formats3. Assign Descriptive File Names4. Perform Basic Quality Assurance/ Quality Control5. Use Scripted Program for Analysis and Keep Notes6. Document EVERYTHING! (Define Contents of Data Files )7. Use Consistent, Stable and Open File Formats
Best Practices BibliographyBorer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M. (2009). Some simple guidelines for effective data management. Bulletin of the Ecological Society of America, 90(2), 205-214.Hook, L. A., Santhana Vannan, S.K., Beaty, T. W., Cook, R. B. and Wilson, B.E. (2010). Best Practices for Preparing Environmental Data Sets to Share and Archive. Available online (http://daac.ornl.gov/PI/BestPractices-2010.pdf) from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/BestPractices-2010.Inter-university Consortium for Political and Social Research (ICPSR). (2012). Guide to social science data preparation and archiving: Best practices throughout the data cycle (5th ed.). Ann Arbor, MI. Retrieved 05/31/2012, from http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf.Data Observation Network for Earth (DataONE). (2012). DataONE Best Practices database. Retrieved 07/21/12, from http://www.dataone.org/best-practices.
Questions? Discussion?• Sherry Lake Senior Scientific Data Consultant, UVA Library• firstname.lastname@example.org• Twitter: shlakeuva• Slideshare: http://www.slideshare.net/shlake• Web: http://www.lib.virginia.edu/brown/data 23