SlideShare a Scribd company logo
1 of 18
Tim Morris
MRC CTU at UCL
25th UK Stata Conference
Michael Crowther
University of Leicester
The Right Way to code
simulation studies in Stata
MRC CTU at UCL
https://github.com/tpmorris/TheRightWay
tldr:
Michael’s way is unambiguously wrong
My way is not unambiguously right
The Right Way is unambiguously right
MRC CTU at UCL
What is a simulation study?
Use of (pseudo) random numbers to produce data from
some distribution to help us to study properties of a
statistical method.
An example:
1. Generate data from a distribution with parameter θ
2. Apply analysis method to data, producing an estimate 𝜃
3. Repeat (1) and (2) nsim times
4. Compare θ with E[ 𝜃] – if we had not generated the data,
we would not know θ and so could not do this.
MRC CTU at UCL
Some background
• Consistent terminology with definitions
• ADEMP (Aims, Data-generating mechanisms,
Estimands, Methods, Performance measures): D, E, M
are important in coding simulation studies
MRC CTU at UCL
Four datasets (possibly)
• Simulated: e.g. a simulated hypothetical study
• Estimates: some summary of 𝑛 𝑠𝑖𝑚 repetitions
• States: record of 𝑛 𝑠𝑖𝑚 + 1 RNG states – at the beginning
of each repetition and one after final repetition
• Performance: summarises estimates of performance
(bias, empirical SE, coverage etc.), and (hopefully) their
Monte Carlo SE, for each D, E, M
MRC CTU at UCL
This talk
This talk focuses on the code that produces a simulated
dataset and returns the estimates and states datasets.
I teach simulation studies a lot. Errors in coding occur
primarily in generating data in the way you want, and in
storing summaries of each repetition (estimates data).
MRC CTU at UCL
A simple simulation study:
Aims
Suppose we are interested in the analysis of a randomised
trial with a survival outcome and unknown baseline hazard
function.
Aim to evaluate the impacts of:
1. misspecifying the baseline hazard function on the
estimate of the treatment effect
2. fitting a more complex model than necessary
3. avoiding the issue by using a semiparametric model
MRC CTU at UCL
Data generating mechanisms
Simulate nobs=100 and then nobs=500 from a Weibull
distribution with 𝑋𝑖~𝐵𝑒𝑟𝑛(.5) and
ℎ 𝑡 = 𝜆𝛾𝑡 𝛾−1 exp 𝑋𝑖 𝜃 where 𝜆 = 0.1, 𝜃 = −0.5
(admin censoring
at 5 years)
Study 𝛾 = 1
then 𝛾 = 1.5
MRC CTU at UCL
Estimands and Methods
Estimand is 𝜃, the hazard ratio for treatment vs. control
Methods:
1. Exponential model
2. Weibull model
3. Cox model
(Don’t need to consider performance measures for this talk;
see London Stata Conference 2020!)
MRC CTU at UCL
rep_id n_obs truegamma method theta_hat se
1 100 γ=1 Exponential -1.690183 .5477225
1 100 γ=1 Weibull -1.712495 .54808
1 100 γ=1 Cox -1.688541 .5481199
1 100 γ=1.5 Exponential -.5390697 .2495417
1 100 γ=1.5 Weibull -.6375546 .2504361
1 100 γ=1.5 Cox -.6162164 .2510851
1 500 γ=1 Exponential -.5785365 .1548867
1 500 γ=1 Weibull -.5820988 .1549543
1 500 γ=1 Cox -.5867053 .1550035
1 500 γ=1.5 Exponential -.4040936 .1188226
1 500 γ=1.5 Weibull -.4308287 .1189563
1 500 γ=1.5 Cox -.4335943 .1190354
Well-structured estimates
Long–long format
Inputs Results
MRC CTU at UCL
rep_id n_obs gamma theta_exp se_exp theta_wei se_wei theta_cox se_cox
1 100 γ=1 -1.690183 .5477225 -1.712495 .54808 -1.688541 .54811
1 100 1.5 -.5164924 .2589072 -.5594682 .2595417 -.5601631 .25988
1 500 γ=1 -.6253604 .1511858 -.6269046 .1512856 -.6343831 .15134
1 500 1.5 -.478514 .1176905 -.5447887 .1179448 -.5460246 .11803
2 100 γ=1 -.377425 .3562627 -.3859514 .3563656 -.3728753 .35644
2 100 1.5 -.4841157 .2456835 -.5684879 .2466851 -.5850977 .24722
2 500 γ=1 -.6477997 .1615617 -.6477113 .161647 -.6452857 .16166
2 500 1.5 -.3358569 .1222584 -.3609435 .1223288 -.3619137 .12240
Well-structured estimates
Wide–long format
Inputs Results
MRC CTU at UCL
The simulate approach
From the help file:
‘simulate eases the programming task of
performing Monte Carlo-type simulations’
… ‘questionable’ to ‘no’.
MRC CTU at UCL
The simulate approach
If you haven’t used it, simulate works as follows:
1. You write a program (rclass or eclass) that follows
standard Stata syntax and returns quantities of interest
as scalars.
2. Your program will generate ≥1 simulated dataset and
return estimates for ≥1 estimands obtained by ≥1
methods.
3. You use simulate to repeatedly call the program.
MRC CTU at UCL
The simulate approach
I’ve wished-&-grumbled here and on Statalist that
simulate:
– Does not allow posting of the repetition number (an
oversight?)
– Precludes putting strings into the estimates dataset,
meaning non-numerical inputs (D) and contents of
c(rngstate) cannot be stored.
– Produces ultra-wide data (if E, M and D vary, the resulting
estimates must be stored across a single row!)
Your code is clean; your estimates dataset is a mess.
MRC CTU at UCL
The post approach
Structure:
tempname tim
postfile `tim' int(rep) str5(dgm estimand) ///
double(theta se) using estimates.dta, replace
forval i = 1/`nsim' {
<1st DGM>
<apply method>
post `tim' (`i') ("thing") ("theta") (_b[trt])
> (_se[trt])
<2nd DGM>
}
postclose `tim'
MRC CTU at UCL
The post approach
+ No shortcomings of simulate
+ Produces a well-formed estimates dataset
– post commands become entangled in the code for
generating and analysing data
– post lines are more error prone. Suppose you are using
different n. An efficient way to code this is to generate a
dataset (with n observations) and then increase subsets of
this data in analysis for the ‘smaller n’ data-generating
mechanisms. The code can get inelegant and you mis-
post.
Your estimates dataset is clean; your code is a mess.
MRC CTU at UCL
The right approach
One can mash-up the two!
1. Write a program, as you would with simulate
2. Use postfile
3. Call the program
4. Post inputs and returned results using post
5. Use a second postfile for storing rngstates
Why?
1. Appease Michael: Tidy code that is less error-prone.
2. Appease Tim: Tidy estimates (and states) dataset that
avoids error-prone reshaping & formatting acrobatics.
MRC CTU at UCL
A query (grumble?)
• None of the options allow for a well-formatted dataset. I
want to define a (unique) sort order, label variables &
values, use chars… (for value labels, order matters; see
below)
• I believe this stuff has to be done afterwards (?)
• To use 1 "Exponential" 2 "Weibull" and 3 "Cox" (I do), I
have to open estimates.dta, label define and label
values. Could this be done up-front so you could e.g. fill
in DGM codes with “Cox”:method_label rather than
number 2?

More Related Content

What's hot

Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationChristopher Peter Makris
 
Application of Principal Components Analysis in Quality Control Problem
Application of Principal Components Analysisin Quality Control ProblemApplication of Principal Components Analysisin Quality Control Problem
Application of Principal Components Analysis in Quality Control ProblemMaxwellWiesler
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningAmAn Singh
 
AE 497 Spring 2015 Final Report
AE 497 Spring 2015 Final ReportAE 497 Spring 2015 Final Report
AE 497 Spring 2015 Final ReportCatherine McCarthy
 
Classification of Grasp Patterns using sEMG
Classification of Grasp Patterns using sEMGClassification of Grasp Patterns using sEMG
Classification of Grasp Patterns using sEMGPriyanka Reddy
 
Data structures algorithms_tutorial
Data structures algorithms_tutorialData structures algorithms_tutorial
Data structures algorithms_tutorialHarikaReddy115
 
Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330JEE HYUN PARK
 
Parameter Estimation for the Weibul distribution model Using Least-Squares Me...
Parameter Estimation for the Weibul distribution model Using Least-Squares Me...Parameter Estimation for the Weibul distribution model Using Least-Squares Me...
Parameter Estimation for the Weibul distribution model Using Least-Squares Me...IJMERJOURNAL
 
Learning from data for wind–wave forecasting
Learning from data for wind–wave forecastingLearning from data for wind–wave forecasting
Learning from data for wind–wave forecastingJonathan D'Cruz
 
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networksSetu Chokshi
 
Forecasting time series for business and operations data: A tutorial
Forecasting time series for business and operations data: A tutorialForecasting time series for business and operations data: A tutorial
Forecasting time series for business and operations data: A tutorialColleen Farrelly
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIMLILAB
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Seval Çapraz
 
Summary.ppt
Summary.pptSummary.ppt
Summary.pptbutest
 

What's hot (20)

Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
Exploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image SegmentationExploration of Imputation Methods for Missingness in Image Segmentation
Exploration of Imputation Methods for Missingness in Image Segmentation
 
Application of Principal Components Analysis in Quality Control Problem
Application of Principal Components Analysisin Quality Control ProblemApplication of Principal Components Analysisin Quality Control Problem
Application of Principal Components Analysis in Quality Control Problem
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 
GMM
GMMGMM
GMM
 
AE 497 Spring 2015 Final Report
AE 497 Spring 2015 Final ReportAE 497 Spring 2015 Final Report
AE 497 Spring 2015 Final Report
 
Classification of Grasp Patterns using sEMG
Classification of Grasp Patterns using sEMGClassification of Grasp Patterns using sEMG
Classification of Grasp Patterns using sEMG
 
Itc stock slide
Itc stock slideItc stock slide
Itc stock slide
 
Data structures algorithms_tutorial
Data structures algorithms_tutorialData structures algorithms_tutorial
Data structures algorithms_tutorial
 
Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330
 
Parameter Estimation for the Weibul distribution model Using Least-Squares Me...
Parameter Estimation for the Weibul distribution model Using Least-Squares Me...Parameter Estimation for the Weibul distribution model Using Least-Squares Me...
Parameter Estimation for the Weibul distribution model Using Least-Squares Me...
 
Learning from data for wind–wave forecasting
Learning from data for wind–wave forecastingLearning from data for wind–wave forecasting
Learning from data for wind–wave forecasting
 
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
 
Forecasting time series for business and operations data: A tutorial
Forecasting time series for business and operations data: A tutorialForecasting time series for business and operations data: A tutorial
Forecasting time series for business and operations data: A tutorial
 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
 
KCC2017 28APR2017
KCC2017 28APR2017KCC2017 28APR2017
KCC2017 28APR2017
 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
 
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
 
Summary.ppt
Summary.pptSummary.ppt
Summary.ppt
 

Similar to The Right Way

Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Pedro Lopes
 
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large DataExtended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large DataAM Publications
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection TechniquesSwati .
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4YoussefKitane
 
IRJET- An Effective Brain Tumor Segmentation using K-means Clustering
IRJET-  	  An Effective Brain Tumor Segmentation using K-means ClusteringIRJET-  	  An Effective Brain Tumor Segmentation using K-means Clustering
IRJET- An Effective Brain Tumor Segmentation using K-means ClusteringIRJET Journal
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScscpconf
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUEScsitconf
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciencesfsmart01
 
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity InterventionVictor Asanza
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaIRJET Journal
 
IRJET - Detection and Classification of Brain Tumor
IRJET - Detection and Classification of Brain TumorIRJET - Detection and Classification of Brain Tumor
IRJET - Detection and Classification of Brain TumorIRJET Journal
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifePeea Bal Chakraborty
 

Similar to The Right Way (20)

report
reportreport
report
 
Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013Poster_Reseau_Neurones_Journees_2013
Poster_Reseau_Neurones_Journees_2013
 
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large DataExtended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
Extended Fuzzy C-Means with Random Sampling Techniques for Clustering Large Data
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
Master's Thesis Presentation
Master's Thesis PresentationMaster's Thesis Presentation
Master's Thesis Presentation
 
IRJET- An Effective Brain Tumor Segmentation using K-means Clustering
IRJET-  	  An Effective Brain Tumor Segmentation using K-means ClusteringIRJET-  	  An Effective Brain Tumor Segmentation using K-means Clustering
IRJET- An Effective Brain Tumor Segmentation using K-means Clustering
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
 
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUESNEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
NEURAL NETWORKS WITH DECISION TREES FOR DIAGNOSIS ISSUES
 
Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
An Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social SciencesAn Introduction to Simulation in the Social Sciences
An Introduction to Simulation in the Social Sciences
 
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
⭐⭐⭐⭐⭐ Finding a Dynamical Model of a Social Norm Physical Activity Intervention
 
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of GlaucomaComparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
Comparative Study of Pre-Trained Neural Network Models in Detection of Glaucoma
 
IRJET - Detection and Classification of Brain Tumor
IRJET - Detection and Classification of Brain TumorIRJET - Detection and Classification of Brain Tumor
IRJET - Detection and Classification of Brain Tumor
 
TBerger_FinalReport
TBerger_FinalReportTBerger_FinalReport
TBerger_FinalReport
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real LifeSimplified Knowledge Prediction: Application of Machine Learning in Real Life
Simplified Knowledge Prediction: Application of Machine Learning in Real Life
 
Mech ma6452 snm_notes
Mech ma6452 snm_notesMech ma6452 snm_notes
Mech ma6452 snm_notes
 

Recently uploaded

zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.k64182334
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 

Recently uploaded (20)

zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.Genomic DNA And Complementary DNA Libraries construction.
Genomic DNA And Complementary DNA Libraries construction.
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 

The Right Way

  • 1. Tim Morris MRC CTU at UCL 25th UK Stata Conference Michael Crowther University of Leicester The Right Way to code simulation studies in Stata
  • 2. MRC CTU at UCL https://github.com/tpmorris/TheRightWay tldr: Michael’s way is unambiguously wrong My way is not unambiguously right The Right Way is unambiguously right
  • 3. MRC CTU at UCL What is a simulation study? Use of (pseudo) random numbers to produce data from some distribution to help us to study properties of a statistical method. An example: 1. Generate data from a distribution with parameter θ 2. Apply analysis method to data, producing an estimate 𝜃 3. Repeat (1) and (2) nsim times 4. Compare θ with E[ 𝜃] – if we had not generated the data, we would not know θ and so could not do this.
  • 4. MRC CTU at UCL Some background • Consistent terminology with definitions • ADEMP (Aims, Data-generating mechanisms, Estimands, Methods, Performance measures): D, E, M are important in coding simulation studies
  • 5. MRC CTU at UCL Four datasets (possibly) • Simulated: e.g. a simulated hypothetical study • Estimates: some summary of 𝑛 𝑠𝑖𝑚 repetitions • States: record of 𝑛 𝑠𝑖𝑚 + 1 RNG states – at the beginning of each repetition and one after final repetition • Performance: summarises estimates of performance (bias, empirical SE, coverage etc.), and (hopefully) their Monte Carlo SE, for each D, E, M
  • 6. MRC CTU at UCL This talk This talk focuses on the code that produces a simulated dataset and returns the estimates and states datasets. I teach simulation studies a lot. Errors in coding occur primarily in generating data in the way you want, and in storing summaries of each repetition (estimates data).
  • 7. MRC CTU at UCL A simple simulation study: Aims Suppose we are interested in the analysis of a randomised trial with a survival outcome and unknown baseline hazard function. Aim to evaluate the impacts of: 1. misspecifying the baseline hazard function on the estimate of the treatment effect 2. fitting a more complex model than necessary 3. avoiding the issue by using a semiparametric model
  • 8. MRC CTU at UCL Data generating mechanisms Simulate nobs=100 and then nobs=500 from a Weibull distribution with 𝑋𝑖~𝐵𝑒𝑟𝑛(.5) and ℎ 𝑡 = 𝜆𝛾𝑡 𝛾−1 exp 𝑋𝑖 𝜃 where 𝜆 = 0.1, 𝜃 = −0.5 (admin censoring at 5 years) Study 𝛾 = 1 then 𝛾 = 1.5
  • 9. MRC CTU at UCL Estimands and Methods Estimand is 𝜃, the hazard ratio for treatment vs. control Methods: 1. Exponential model 2. Weibull model 3. Cox model (Don’t need to consider performance measures for this talk; see London Stata Conference 2020!)
  • 10. MRC CTU at UCL rep_id n_obs truegamma method theta_hat se 1 100 γ=1 Exponential -1.690183 .5477225 1 100 γ=1 Weibull -1.712495 .54808 1 100 γ=1 Cox -1.688541 .5481199 1 100 γ=1.5 Exponential -.5390697 .2495417 1 100 γ=1.5 Weibull -.6375546 .2504361 1 100 γ=1.5 Cox -.6162164 .2510851 1 500 γ=1 Exponential -.5785365 .1548867 1 500 γ=1 Weibull -.5820988 .1549543 1 500 γ=1 Cox -.5867053 .1550035 1 500 γ=1.5 Exponential -.4040936 .1188226 1 500 γ=1.5 Weibull -.4308287 .1189563 1 500 γ=1.5 Cox -.4335943 .1190354 Well-structured estimates Long–long format Inputs Results
  • 11. MRC CTU at UCL rep_id n_obs gamma theta_exp se_exp theta_wei se_wei theta_cox se_cox 1 100 γ=1 -1.690183 .5477225 -1.712495 .54808 -1.688541 .54811 1 100 1.5 -.5164924 .2589072 -.5594682 .2595417 -.5601631 .25988 1 500 γ=1 -.6253604 .1511858 -.6269046 .1512856 -.6343831 .15134 1 500 1.5 -.478514 .1176905 -.5447887 .1179448 -.5460246 .11803 2 100 γ=1 -.377425 .3562627 -.3859514 .3563656 -.3728753 .35644 2 100 1.5 -.4841157 .2456835 -.5684879 .2466851 -.5850977 .24722 2 500 γ=1 -.6477997 .1615617 -.6477113 .161647 -.6452857 .16166 2 500 1.5 -.3358569 .1222584 -.3609435 .1223288 -.3619137 .12240 Well-structured estimates Wide–long format Inputs Results
  • 12. MRC CTU at UCL The simulate approach From the help file: ‘simulate eases the programming task of performing Monte Carlo-type simulations’ … ‘questionable’ to ‘no’.
  • 13. MRC CTU at UCL The simulate approach If you haven’t used it, simulate works as follows: 1. You write a program (rclass or eclass) that follows standard Stata syntax and returns quantities of interest as scalars. 2. Your program will generate ≥1 simulated dataset and return estimates for ≥1 estimands obtained by ≥1 methods. 3. You use simulate to repeatedly call the program.
  • 14. MRC CTU at UCL The simulate approach I’ve wished-&-grumbled here and on Statalist that simulate: – Does not allow posting of the repetition number (an oversight?) – Precludes putting strings into the estimates dataset, meaning non-numerical inputs (D) and contents of c(rngstate) cannot be stored. – Produces ultra-wide data (if E, M and D vary, the resulting estimates must be stored across a single row!) Your code is clean; your estimates dataset is a mess.
  • 15. MRC CTU at UCL The post approach Structure: tempname tim postfile `tim' int(rep) str5(dgm estimand) /// double(theta se) using estimates.dta, replace forval i = 1/`nsim' { <1st DGM> <apply method> post `tim' (`i') ("thing") ("theta") (_b[trt]) > (_se[trt]) <2nd DGM> } postclose `tim'
  • 16. MRC CTU at UCL The post approach + No shortcomings of simulate + Produces a well-formed estimates dataset – post commands become entangled in the code for generating and analysing data – post lines are more error prone. Suppose you are using different n. An efficient way to code this is to generate a dataset (with n observations) and then increase subsets of this data in analysis for the ‘smaller n’ data-generating mechanisms. The code can get inelegant and you mis- post. Your estimates dataset is clean; your code is a mess.
  • 17. MRC CTU at UCL The right approach One can mash-up the two! 1. Write a program, as you would with simulate 2. Use postfile 3. Call the program 4. Post inputs and returned results using post 5. Use a second postfile for storing rngstates Why? 1. Appease Michael: Tidy code that is less error-prone. 2. Appease Tim: Tidy estimates (and states) dataset that avoids error-prone reshaping & formatting acrobatics.
  • 18. MRC CTU at UCL A query (grumble?) • None of the options allow for a well-formatted dataset. I want to define a (unique) sort order, label variables & values, use chars… (for value labels, order matters; see below) • I believe this stuff has to be done afterwards (?) • To use 1 "Exponential" 2 "Weibull" and 3 "Cox" (I do), I have to open estimates.dta, label define and label values. Could this be done up-front so you could e.g. fill in DGM codes with “Cox”:method_label rather than number 2?