SlideShare a Scribd company logo
The Data Analysis Workflow
Professor Vernon Gayle
vernon.gayle@ed.ac.uk
@Profbigvern
https://github.com/vernongayle
The Data Analysis
Workflow
1. Have you ever lost a file?
2. Have you ever wondered if you have
deleted a file?
3. Have you and a colleague ever been
working on different versions of a file?
A Thought Experiment (be honest…)
4. Have you ever struggled to identify data files?
chapter1_2019.dat
chap1_2019.dat
Be honest…A Thought Experiment (be honest…)
Data
Download
Data
Wrangling
SPSS
Stata
R
Python
Data
Download
Data
Wrangling
Exploratory
Data
Analysis
Statistical
Modelling
Write-Up
Results
SPSS
Stata
R
Python
Analysing data without a
planned and organised
workflow can be compared to
drinking and driving. In both
situations it doesn’t matter how
careful you are, it is still highly
likely to end in a wreck!
(Philip Stark UC Berkeley)
Therefore just like drinking and
driving, we strongly warn
against not having a
systematic workflow
The Workflow
Workflow should be planned and carefully
orchestrated
Workflow MUST NOT be adhoc
(e.g. piece-meal, developed as a reaction to mistakes etc.)
The Workflow
The Workflow Cycle
Data
Download
Data
Wrangling
SPSS
Stata
R
Python
GUIs will leave you in a sticky mess!
Drop down menus = no audit trail
Data Wrangling
• Selecting variables (surveys have large k)
• Operationalising measures (e.g. income; social class; education)
• Re-coding variables
• Selection cases (sub-samples in large surveys)
• Missing data
Data Wrangling
• Selecting variables (surveys have large k)
• Operationalising measures (e.g. income; social class;
education)
• Re-coding variables
• Selection cases (sub-samples in large surveys)
• Missing data
Minor actions in the
Data Wrangling Phase
can have major consequences!
Exploratory
Data
Analysis
Statistical
Modelling
Write-Up
Results
SPSS
Stata
R
Python
Data Analysis
• Which cases
• Which variables (format)
• Missing data
• Estimation method (mle; gls)
• Weights and survey structure (svy)
• Setting a seed
• Number of quadrature points
• Which version or library
Data Analysis
• Which cases
• Which variables (format)
• Missing data
• Estimation method (mle; gls)
• Weights and survey structure (svy)
• Setting a seed
• Number of quadrature points
• Which version or library
Better supporting YOU and what YOU DO
Not changing you into something YOU ARE NOT
Improving the Workflow
ALL SERIOUS WORK MUST
BE REPRODUCIBLE!
There MUST be an audit trail
Transparency
Progamming
Efficiency
Accuracy
Reproducibility
A Planned Workflow Has Benefits
• Accuracy
• minimising information loss and errors in analyses and output
• Programming Efficiency
• automation, maximising features in software
• Transparency
• showing what you did, why, when, how
• Reproducibility
• same results every time whoever or wherever
• editing, rewriting reports or re-submission of papers
26
Four Pillars of Wisdom
It is always easier to document today than it is tomorrow!
Corollary 1:
Nobody likes to write documentation
Corollary 2:
Nobody ever regrets having written documentation
Long’s Law
Has anyone in the history of data analysis ever said
“these files are too well documented”
Long’s Law
Good News
•Improving the workflow with a modest amount of
effort
•The less experience you have the better
(start from the very beginning)
Some Practical
Strategies and Tactics
Standard Directory Structure
File Naming Protocols
File Name = name_date_depositor's initials_version_type
Therefore bhpsaindresp_20140506_vg_v1.dta
Would be a
a. The British Household Panel Survey File “aindresp”
b. Deposited on 6th May 2014
c. Deposited by vg (Vernon Gayle)
d. Version v1
e. File type (e.g. a Stata .dta file)
File Naming Protocols
Analysis File Structure
and Information
Variable Naming Protocols
• Good data providers will tend to stick to clearly defined variable
naming conventions
• Wave 1 of UKHLS a_sex is the gender variable, and in wave 2
of the study b_sex is the gender variable (UK Data Archive
Study Number 6614)
• Youth Cohort Time Series for England, Wales and Scotland,
1984-2002 the variable t0schtyp is the type of school that the
pupil attended in Year 11 (at time point t0 in the study) (UK Data
Archive Study Number 5765)
By contrast in the less well curated Youth Cohort Study
of England and Wales Cohort 4 (UK Data Archive
Study Number 3107) this is the opaque variable
dx11_a
Variable Naming Protocols
Estimating Work Time…
Benefits of a Good Workflow
• Limits duplication of effort
• Systematic work minimises errors
• Helps to detect errors
• Editing work, resubmitting papers etc.
• Aids collaboration
• Supports additional (e.g. future) work
What To Do…
• Audit trail
• Don’t use drop menus ( a gui = a sticky mess)
• Systematic directory structure, file names, versions,
variable names etc.
• Plan! don’t do ad hoc work or work on the fly
• Write comments
J. Scott Long has posted a really good pdf version of a talk on the
workflow http://www.ihrp.uic.edu/files/Workflow%20Slides%20JSLong%20110410.pdf
http://eprints.ncrm.ac.uk/4000/
ALL SERIOUS WORK MUST
BE REPRODUCIBLE!
There MUST be an audit trail
How to cite this video
Gayle, V. (2020) The Data Analysis Workflow. Available at: https://www.ncrm.ac.uk
(Accessed: day month year)
The Data Analysis Workflow
Professor Vernon Gayle
vernon.gayle@ed.ac.uk
@Profbigvern
https://github.com/vernongayle

More Related Content

Similar to The Data Analysis Workflow

Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behavior
James Howison
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic Concepts
Sr Edith Bogue
 
Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentation
Louis Rosenfeld
 
Data and Donuts: Data organization
Data and Donuts: Data organizationData and Donuts: Data organization
Data and Donuts: Data organization
C. Tobin Magle
 
Business analyst
Business analystBusiness analyst
Business analyst
Hemanth Kumar
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto University
Stephanie Simms
 
Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptx
charlslabarda
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
Sherry Lake
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
Paul Groth
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
Ali Murfi
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
Susanna-Assunta Sansone
 
Six Sigma Presentation Storybd 07 Mar24
Six Sigma Presentation Storybd 07 Mar24Six Sigma Presentation Storybd 07 Mar24
Six Sigma Presentation Storybd 07 Mar24
SKelly514
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
Yasmin AlNoamany, PhD
 
Ischools workshop - 4 - data discovery
Ischools workshop - 4 - data discoveryIschools workshop - 4 - data discovery
Ischools workshop - 4 - data discovery
ARDC
 
Doing your systematic review: managing data and reporting
Doing your systematic review: managing data and reportingDoing your systematic review: managing data and reporting
Doing your systematic review: managing data and reporting
University of Liverpool Library
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
Paul Groth
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
Rothamsted Research, UK
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
J T "Tom" Johnson
 

Similar to The Data Analysis Workflow (20)

Studying archives of online behavior
Studying archives of online behaviorStudying archives of online behavior
Studying archives of online behavior
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic Concepts
 
Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentation
 
Data and Donuts: Data organization
Data and Donuts: Data organizationData and Donuts: Data organization
Data and Donuts: Data organization
 
Business analyst
Business analystBusiness analyst
Business analyst
 
Support Your Data, Kyoto University
Support Your Data, Kyoto UniversitySupport Your Data, Kyoto University
Support Your Data, Kyoto University
 
Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptx
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Metopen 6
Metopen 6Metopen 6
Metopen 6
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
Six Sigma Presentation Storybd 07 Mar24
Six Sigma Presentation Storybd 07 Mar24Six Sigma Presentation Storybd 07 Mar24
Six Sigma Presentation Storybd 07 Mar24
 
A Guide for Reproducible Research
A Guide for Reproducible ResearchA Guide for Reproducible Research
A Guide for Reproducible Research
 
Ischools workshop - 4 - data discovery
Ischools workshop - 4 - data discoveryIschools workshop - 4 - data discovery
Ischools workshop - 4 - data discovery
 
Doing your systematic review: managing data and reporting
Doing your systematic review: managing data and reportingDoing your systematic review: managing data and reporting
Doing your systematic review: managing data and reporting
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
AgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use CasesAgriFood Data, Models, Standards, Tools, Use Cases
AgriFood Data, Models, Standards, Tools, Use Cases
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 

More from JonathanEarley3

Creating video presentations for research methods teaching video slides
Creating video presentations for research methods teaching video slidesCreating video presentations for research methods teaching video slides
Creating video presentations for research methods teaching video slides
JonathanEarley3
 
Count data
Count dataCount data
Count data
JonathanEarley3
 
Reproducible Social Research
Reproducible Social ResearchReproducible Social Research
Reproducible Social Research
JonathanEarley3
 
Limitations of walking methods and integrating digital methods for disseminat...
Limitations of walking methods and integrating digital methods for disseminat...Limitations of walking methods and integrating digital methods for disseminat...
Limitations of walking methods and integrating digital methods for disseminat...
JonathanEarley3
 
Walking methods in practice – two international case studies
Walking methods in practice – two international case studiesWalking methods in practice – two international case studies
Walking methods in practice – two international case studies
JonathanEarley3
 
Theory of walking methods
Theory of walking methodsTheory of walking methods
Theory of walking methods
JonathanEarley3
 
Embodied Methodologies: The Body as Research Instrument
Embodied Methodologies: The Body as Research InstrumentEmbodied Methodologies: The Body as Research Instrument
Embodied Methodologies: The Body as Research Instrument
JonathanEarley3
 
Participatory data gathering with Ketso
Participatory data gathering with KetsoParticipatory data gathering with Ketso
Participatory data gathering with Ketso
JonathanEarley3
 

More from JonathanEarley3 (8)

Creating video presentations for research methods teaching video slides
Creating video presentations for research methods teaching video slidesCreating video presentations for research methods teaching video slides
Creating video presentations for research methods teaching video slides
 
Count data
Count dataCount data
Count data
 
Reproducible Social Research
Reproducible Social ResearchReproducible Social Research
Reproducible Social Research
 
Limitations of walking methods and integrating digital methods for disseminat...
Limitations of walking methods and integrating digital methods for disseminat...Limitations of walking methods and integrating digital methods for disseminat...
Limitations of walking methods and integrating digital methods for disseminat...
 
Walking methods in practice – two international case studies
Walking methods in practice – two international case studiesWalking methods in practice – two international case studies
Walking methods in practice – two international case studies
 
Theory of walking methods
Theory of walking methodsTheory of walking methods
Theory of walking methods
 
Embodied Methodologies: The Body as Research Instrument
Embodied Methodologies: The Body as Research InstrumentEmbodied Methodologies: The Body as Research Instrument
Embodied Methodologies: The Body as Research Instrument
 
Participatory data gathering with Ketso
Participatory data gathering with KetsoParticipatory data gathering with Ketso
Participatory data gathering with Ketso
 

Recently uploaded

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
paigestewart1632
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 

Recently uploaded (20)

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 

The Data Analysis Workflow

  • 1. The Data Analysis Workflow Professor Vernon Gayle vernon.gayle@ed.ac.uk @Profbigvern https://github.com/vernongayle
  • 2.
  • 4. 1. Have you ever lost a file? 2. Have you ever wondered if you have deleted a file? 3. Have you and a colleague ever been working on different versions of a file? A Thought Experiment (be honest…)
  • 5. 4. Have you ever struggled to identify data files? chapter1_2019.dat chap1_2019.dat Be honest…A Thought Experiment (be honest…)
  • 6.
  • 7.
  • 8.
  • 9.
  • 12. Analysing data without a planned and organised workflow can be compared to drinking and driving. In both situations it doesn’t matter how careful you are, it is still highly likely to end in a wreck! (Philip Stark UC Berkeley) Therefore just like drinking and driving, we strongly warn against not having a systematic workflow The Workflow
  • 13. Workflow should be planned and carefully orchestrated Workflow MUST NOT be adhoc (e.g. piece-meal, developed as a reaction to mistakes etc.) The Workflow
  • 16. GUIs will leave you in a sticky mess! Drop down menus = no audit trail
  • 17. Data Wrangling • Selecting variables (surveys have large k) • Operationalising measures (e.g. income; social class; education) • Re-coding variables • Selection cases (sub-samples in large surveys) • Missing data
  • 18. Data Wrangling • Selecting variables (surveys have large k) • Operationalising measures (e.g. income; social class; education) • Re-coding variables • Selection cases (sub-samples in large surveys) • Missing data
  • 19. Minor actions in the Data Wrangling Phase can have major consequences!
  • 21. Data Analysis • Which cases • Which variables (format) • Missing data • Estimation method (mle; gls) • Weights and survey structure (svy) • Setting a seed • Number of quadrature points • Which version or library
  • 22. Data Analysis • Which cases • Which variables (format) • Missing data • Estimation method (mle; gls) • Weights and survey structure (svy) • Setting a seed • Number of quadrature points • Which version or library
  • 23. Better supporting YOU and what YOU DO Not changing you into something YOU ARE NOT Improving the Workflow
  • 24. ALL SERIOUS WORK MUST BE REPRODUCIBLE! There MUST be an audit trail
  • 26. • Accuracy • minimising information loss and errors in analyses and output • Programming Efficiency • automation, maximising features in software • Transparency • showing what you did, why, when, how • Reproducibility • same results every time whoever or wherever • editing, rewriting reports or re-submission of papers 26 Four Pillars of Wisdom
  • 27. It is always easier to document today than it is tomorrow! Corollary 1: Nobody likes to write documentation Corollary 2: Nobody ever regrets having written documentation Long’s Law
  • 28. Has anyone in the history of data analysis ever said “these files are too well documented” Long’s Law
  • 29. Good News •Improving the workflow with a modest amount of effort •The less experience you have the better (start from the very beginning)
  • 32. File Naming Protocols File Name = name_date_depositor's initials_version_type Therefore bhpsaindresp_20140506_vg_v1.dta Would be a a. The British Household Panel Survey File “aindresp” b. Deposited on 6th May 2014 c. Deposited by vg (Vernon Gayle) d. Version v1 e. File type (e.g. a Stata .dta file) File Naming Protocols
  • 34. Variable Naming Protocols • Good data providers will tend to stick to clearly defined variable naming conventions • Wave 1 of UKHLS a_sex is the gender variable, and in wave 2 of the study b_sex is the gender variable (UK Data Archive Study Number 6614) • Youth Cohort Time Series for England, Wales and Scotland, 1984-2002 the variable t0schtyp is the type of school that the pupil attended in Year 11 (at time point t0 in the study) (UK Data Archive Study Number 5765)
  • 35. By contrast in the less well curated Youth Cohort Study of England and Wales Cohort 4 (UK Data Archive Study Number 3107) this is the opaque variable dx11_a Variable Naming Protocols
  • 37. Benefits of a Good Workflow • Limits duplication of effort • Systematic work minimises errors • Helps to detect errors • Editing work, resubmitting papers etc. • Aids collaboration • Supports additional (e.g. future) work
  • 38.
  • 39. What To Do… • Audit trail • Don’t use drop menus ( a gui = a sticky mess) • Systematic directory structure, file names, versions, variable names etc. • Plan! don’t do ad hoc work or work on the fly • Write comments
  • 40. J. Scott Long has posted a really good pdf version of a talk on the workflow http://www.ihrp.uic.edu/files/Workflow%20Slides%20JSLong%20110410.pdf http://eprints.ncrm.ac.uk/4000/
  • 41. ALL SERIOUS WORK MUST BE REPRODUCIBLE! There MUST be an audit trail
  • 42. How to cite this video Gayle, V. (2020) The Data Analysis Workflow. Available at: https://www.ncrm.ac.uk (Accessed: day month year)
  • 43. The Data Analysis Workflow Professor Vernon Gayle vernon.gayle@ed.ac.uk @Profbigvern https://github.com/vernongayle