Elements of Data Documentation
Adam Mack
Education and Human Development Incubator (EHDi)
Social Science Research Institute
October 1, 2015
Why Is Documentation Important?
• Describe the contents of the data
• Explain context in which data was collected
• Explain any manipulations performed on the
data
• Allow research data to be understood by
people outside of the original project
Do I Need to Document?
Back in the day… … and now.
Research:
Consequences of Insufficient
Documentation
Consequences of Insufficient
Documentation
• Data may be unusable
• May make inaccurate assumptions about data
– Manipulations performed on data may affect
results of analyses
– May be unclear how to interpret contents of a
variable
Consequences of Insufficient Documentation:
Example
• Assume each of the following prompts is answered
on a 1–5 agreement scale.
– Data management is great. (dmgreat) 5
– Data management is the greatest! (dmgrtst) 5
– I don’t like data management. (dmnolike) 1
• Dmnolike needs to be reversed scored (to 5) before a
scale score can be calculated from the variables.
• You can recode this value within the same variable,
but should you?
Elements of Data Documentation
• What are the most important elements to
document?
Elements of Data Documentation
• What are the most important elements to
document?
– Data elements
– Study elements
– Processes and decisions
Elements of Data Documentation
• Who will be using the documentation?
– Data managers
– Statisticians
– Researchers
– Outside users
Elements of Data Documentation
• When should documentation be created?
– Often, projects wait until data has been collected
before creating documentation such as
codebooks.
– Creating documentation early in the project has
numerous advantages.
Elements of Data Documentation
• How should these elements be documented?
Potential forms that documentation may take
include:
– Codebook
Elements of Data Documentation
• How should these elements be documented?
Potential forms that documentation may take
include:
– Annotated version of instrument
Elements of Data Documentation
• How should these elements be documented?
Potential forms that documentation may take
include:
– More descriptive, less structured forms of
documentation (data narratives)
Data-Level Documentation
• What are the most important elements to
document?
– Data elements
– Study elements
– Processes and decisions
Data-Level Documentation
Should include basic information needed to use
the data, including:
• Structural information about variable
– Name of variable
– Label (if applicable)
– Type of variable (numeric or character)
– Length of variable
Data-Level Documentation
• Information describing variable contents
– Question text (or text description of variable
contents)
– Valid values
– Coding of values
Data-Level Documentation
• Scales/derived variables
– Algorithm used to create variable
– Procedures for handling missing data
Data-Level Documentation
• Question routing (if skip patterns used)
– Identify number of participants asked each
question/path through survey
• Error checking/validation
Data-Level Documentation
• Reliability of scales
– Calculate Cronbach’s alpha for each scale included
in the data
– Compare values for your study to previously
reported values in the literature
Types of Data Documentation
• Tabular codebook (Excel)
– Good for organizing a large amount of information
concisely
– Sortable
– Filterable
– Customizable; can hide columns that may be
needed but are not of interest to a general
audience
Tabular Codebook
Types of Data Documentation
• Annotated instrument
– Contains basic variable and value information in
context
– Easy to interpret
– Difficult to integrate much additional detail; not
useful for some forms of data
Annotated Instrument
Study-Level Documentation
• What are the most important elements to
document?
– Data elements
– Study elements
– Processes and decisions
Study-Level Documentation
• Details about the source of the data
– Study design and purpose
– Collection method
– Information about the research sample
– Longitudinal time points (if applicable)
Study-Level Documentation
• Information about data files
– File name/version
– Date created
– Number of records
– Number of variables
– Changes since last version of file
Study-Level Documentation
• Information about measures used
– Description of measure
– Description of scales
– Source of measure, including references as
appropriate
Study-Level Documentation
Programs used to process/manipulate data
– Documentation within program (comments)
Study-Level Documentation
Programs used to process/manipulate data
– Documentation of what various programs do and
in what order they are used
Program Description
SSIS_01 Creates data set with 1st batch of data. Includes scoring
code for social skills and problem behavior scales and
subscales.
SSIS_01a Corrects scoring issue with problem behavior scale.
SSIS_02 Adds 2nd batch of data; adds assessment date and birth
date information to allow calculation of age-dependent
scores.
SSIS_03 Adds 3rd batch of data.
Study-Level Documentation
• Data narrative
– Good for measure/study-level information
Study-Level Documentation
• Data narrative (continued)
Decision and Process Documentation
• What are the most important elements to
document?
– Data elements
– Study elements
– Processes and decisions
Decision and Process Documentation
• By far, the least established area of research
documentation.
• Due to individual differences between
research projects, it can be difficult to identify
a standard template.
Decision and Process Documentation
Elements to include in documentation:
• Scope (variables/measures)
• Time (if multiple time points)
• Describe purpose of process or situation
requiring a decision being made
Decision and Process Documentation
Elements to include in documentation:
• Information from the data that describes or
affects the decision or process
• A description of the process itself, including:
– Any software or tools needed to complete the
process
– Any resources /references used
Decision and Process Documentation
• What sorts of decisions and processes should
be documented with this level of detail?
– Basic scales and processes that are commonly
utilized may not require this much detail
– Processes and procedures that are not well
established or that deviate significantly from the
standard method should be documented
Decision and Process Documentation
• Examples of processes that might need to be
documented
– Naming conventions for variables
– Naming conventions for data files
– Structure of data directories
– Version information
Decision and Process Documentation
• Examples of decisions that might need to be
documented
– Resolving discrepancies in data obtained from
multiple sources or at multiple time points
– Data transformations that require interpretation
Decision and Process Documentation
Tools for Documentation
• Statistical software packages (e.g. SAS, Stata)
– Variable information (PROC contents; describe)
– Provides a good starting point for a codebook
• Database management systems
Tools for Documentation
• Data collection instruments
– Paper forms
– Electronic/online collection
PROC CODEBOOK (SAS)
• PROC CODEBOOK is a SAS macro that creates
a codebook based on a SAS data set
PROC CODEBOOK (SAS)
• Requirements
– Labels on variables and data set
– Formats assigned to categorical values
– Minimum of 1 categorical/2 numeric variables
• Optional elements
– Ordering of variables (default is by variable name)
– ODS formatting of title text
PROC CODEBOOK (SAS)
• Can be useful when dealing with data sets that
include SAS formats
• If data set does not already have formats applied,
may take as much time to add them as to create
your own codebook (which has more flexibility)
• To download the SAS macro and access
documentation, visit
http://www.cpc.unc.edu/research/tools/data_an
alysis/proc_codebook
Documentation Standards
• How can we document the data in a way that
helps interested parties find the data?
• Dublin Core
– Includes 15 standard elements.
– Intended for describing a wide range of different web-
based or physical resources
• Data Documentation Initiative
– An international specification for describing data from the
social, behavioral, and economic sciences
– Supports the entire research data lifecycle
The Takeaway
• Good documentation is not just a product, it’s
an approach
Resources
• Inter-university Consortium for Political and
Social Research (ICPSR)
– Guide to Social Science Data Preparation and
Archiving
• Cornell Research Data Management Service
Group
– Guide to writing "readme" style metadata
• Duke University Libraries
Questions?
• Ask away!
• If you would like to talk more about
documentation for your own projects, contact
us at ehdidata@duke.edu.
• Thanks for coming!
Acknowledgements
For their help in putting together this workshop:
• Lorrie Schmid
• Chandler Thomas
And for helping keep you interested in the material:
• Darth Vader
• Success Kid
• Mark Wahlberg (and @ResearchMark)

Elements of Data Documentation

  • 1.
    Elements of DataDocumentation Adam Mack Education and Human Development Incubator (EHDi) Social Science Research Institute October 1, 2015
  • 2.
    Why Is DocumentationImportant? • Describe the contents of the data • Explain context in which data was collected • Explain any manipulations performed on the data • Allow research data to be understood by people outside of the original project
  • 3.
    Do I Needto Document? Back in the day… … and now. Research:
  • 4.
  • 5.
    Consequences of Insufficient Documentation •Data may be unusable • May make inaccurate assumptions about data – Manipulations performed on data may affect results of analyses – May be unclear how to interpret contents of a variable
  • 6.
    Consequences of InsufficientDocumentation: Example • Assume each of the following prompts is answered on a 1–5 agreement scale. – Data management is great. (dmgreat) 5 – Data management is the greatest! (dmgrtst) 5 – I don’t like data management. (dmnolike) 1 • Dmnolike needs to be reversed scored (to 5) before a scale score can be calculated from the variables. • You can recode this value within the same variable, but should you?
  • 7.
    Elements of DataDocumentation • What are the most important elements to document?
  • 8.
    Elements of DataDocumentation • What are the most important elements to document? – Data elements – Study elements – Processes and decisions
  • 9.
    Elements of DataDocumentation • Who will be using the documentation? – Data managers – Statisticians – Researchers – Outside users
  • 10.
    Elements of DataDocumentation • When should documentation be created? – Often, projects wait until data has been collected before creating documentation such as codebooks. – Creating documentation early in the project has numerous advantages.
  • 11.
    Elements of DataDocumentation • How should these elements be documented? Potential forms that documentation may take include: – Codebook
  • 12.
    Elements of DataDocumentation • How should these elements be documented? Potential forms that documentation may take include: – Annotated version of instrument
  • 13.
    Elements of DataDocumentation • How should these elements be documented? Potential forms that documentation may take include: – More descriptive, less structured forms of documentation (data narratives)
  • 14.
    Data-Level Documentation • Whatare the most important elements to document? – Data elements – Study elements – Processes and decisions
  • 15.
    Data-Level Documentation Should includebasic information needed to use the data, including: • Structural information about variable – Name of variable – Label (if applicable) – Type of variable (numeric or character) – Length of variable
  • 16.
    Data-Level Documentation • Informationdescribing variable contents – Question text (or text description of variable contents) – Valid values – Coding of values
  • 17.
    Data-Level Documentation • Scales/derivedvariables – Algorithm used to create variable – Procedures for handling missing data
  • 18.
    Data-Level Documentation • Questionrouting (if skip patterns used) – Identify number of participants asked each question/path through survey • Error checking/validation
  • 19.
    Data-Level Documentation • Reliabilityof scales – Calculate Cronbach’s alpha for each scale included in the data – Compare values for your study to previously reported values in the literature
  • 20.
    Types of DataDocumentation • Tabular codebook (Excel) – Good for organizing a large amount of information concisely – Sortable – Filterable – Customizable; can hide columns that may be needed but are not of interest to a general audience
  • 21.
  • 22.
    Types of DataDocumentation • Annotated instrument – Contains basic variable and value information in context – Easy to interpret – Difficult to integrate much additional detail; not useful for some forms of data
  • 23.
  • 24.
    Study-Level Documentation • Whatare the most important elements to document? – Data elements – Study elements – Processes and decisions
  • 25.
    Study-Level Documentation • Detailsabout the source of the data – Study design and purpose – Collection method – Information about the research sample – Longitudinal time points (if applicable)
  • 26.
    Study-Level Documentation • Informationabout data files – File name/version – Date created – Number of records – Number of variables – Changes since last version of file
  • 27.
    Study-Level Documentation • Informationabout measures used – Description of measure – Description of scales – Source of measure, including references as appropriate
  • 28.
    Study-Level Documentation Programs usedto process/manipulate data – Documentation within program (comments)
  • 29.
    Study-Level Documentation Programs usedto process/manipulate data – Documentation of what various programs do and in what order they are used Program Description SSIS_01 Creates data set with 1st batch of data. Includes scoring code for social skills and problem behavior scales and subscales. SSIS_01a Corrects scoring issue with problem behavior scale. SSIS_02 Adds 2nd batch of data; adds assessment date and birth date information to allow calculation of age-dependent scores. SSIS_03 Adds 3rd batch of data.
  • 30.
    Study-Level Documentation • Datanarrative – Good for measure/study-level information
  • 31.
  • 32.
    Decision and ProcessDocumentation • What are the most important elements to document? – Data elements – Study elements – Processes and decisions
  • 33.
    Decision and ProcessDocumentation • By far, the least established area of research documentation. • Due to individual differences between research projects, it can be difficult to identify a standard template.
  • 34.
    Decision and ProcessDocumentation Elements to include in documentation: • Scope (variables/measures) • Time (if multiple time points) • Describe purpose of process or situation requiring a decision being made
  • 35.
    Decision and ProcessDocumentation Elements to include in documentation: • Information from the data that describes or affects the decision or process • A description of the process itself, including: – Any software or tools needed to complete the process – Any resources /references used
  • 36.
    Decision and ProcessDocumentation • What sorts of decisions and processes should be documented with this level of detail? – Basic scales and processes that are commonly utilized may not require this much detail – Processes and procedures that are not well established or that deviate significantly from the standard method should be documented
  • 37.
    Decision and ProcessDocumentation • Examples of processes that might need to be documented – Naming conventions for variables – Naming conventions for data files – Structure of data directories – Version information
  • 38.
    Decision and ProcessDocumentation • Examples of decisions that might need to be documented – Resolving discrepancies in data obtained from multiple sources or at multiple time points – Data transformations that require interpretation
  • 39.
    Decision and ProcessDocumentation
  • 40.
    Tools for Documentation •Statistical software packages (e.g. SAS, Stata) – Variable information (PROC contents; describe) – Provides a good starting point for a codebook • Database management systems
  • 41.
    Tools for Documentation •Data collection instruments – Paper forms – Electronic/online collection
  • 42.
    PROC CODEBOOK (SAS) •PROC CODEBOOK is a SAS macro that creates a codebook based on a SAS data set
  • 43.
    PROC CODEBOOK (SAS) •Requirements – Labels on variables and data set – Formats assigned to categorical values – Minimum of 1 categorical/2 numeric variables • Optional elements – Ordering of variables (default is by variable name) – ODS formatting of title text
  • 44.
    PROC CODEBOOK (SAS) •Can be useful when dealing with data sets that include SAS formats • If data set does not already have formats applied, may take as much time to add them as to create your own codebook (which has more flexibility) • To download the SAS macro and access documentation, visit http://www.cpc.unc.edu/research/tools/data_an alysis/proc_codebook
  • 45.
    Documentation Standards • Howcan we document the data in a way that helps interested parties find the data? • Dublin Core – Includes 15 standard elements. – Intended for describing a wide range of different web- based or physical resources • Data Documentation Initiative – An international specification for describing data from the social, behavioral, and economic sciences – Supports the entire research data lifecycle
  • 46.
    The Takeaway • Gooddocumentation is not just a product, it’s an approach
  • 47.
    Resources • Inter-university Consortiumfor Political and Social Research (ICPSR) – Guide to Social Science Data Preparation and Archiving • Cornell Research Data Management Service Group – Guide to writing "readme" style metadata • Duke University Libraries
  • 48.
    Questions? • Ask away! •If you would like to talk more about documentation for your own projects, contact us at ehdidata@duke.edu. • Thanks for coming!
  • 49.
    Acknowledgements For their helpin putting together this workshop: • Lorrie Schmid • Chandler Thomas And for helping keep you interested in the material: • Darth Vader • Success Kid • Mark Wahlberg (and @ResearchMark)