This document summarizes a 5-year journey of using R as the sole statistical analysis software at a CRO. Some key points:
- Initially there were questions around whether R would be sufficient, what hidden costs there may be to using open source software, how to validate and organize the working environment, and which packages would be needed.
- After 5 years of experience, the CRO found that R mostly sufficed for their work but using open source "is not free" as time must be spent collecting tools, validating them, dealing with package failures, and reporting issues. This amounts to costs equivalent to commercial software licenses.
- The CRO developed a simple, automated, template-based workflow
1. Meet a 100% R-based CRO
The summary of a 5-year journey
Adrian Olszewski
Principal Biostatistician at 2KMM CRO
The R/Pharma 2022 Conference, Nov 10th 2022
10min
www.2kmm.eu
aolszewski@2kmm.pl
2. Disclaimer
2
This presentation shows the biostatistician’s perspective first.
Lots of exploratory research, involving tens of statistical tests, complex survival models and non-parametric
methods. Here producing TFLs is important but secondary. I need the NUMBERS first to populate them.
It’s not to „blame” or „unfairly criticize”.
My job is to analysis trial’s data on time and within the budget.
If something does not work so I cannot meet the deadline, fixing things exceeds the budget, the situation
seems hopeless – there’s no time for sentiments, ideology, and hiding problems. I’m going to be held
accountable for the effects of my work, not for my „love for the tool”.
„You get what you pay for. It’s a free software. Stop complaining.
The fact that something is „for free” does not mean it cannot be improved. The first step is to admit problems
exist, diagnoze them and be honest. It needs a sober assessment of the situation to counteract effectively.
So why do you „waste your time”? Buy XYZ® and be happy
I really want to make things better. I ❤️ R. If i did not, I’d have abandoned it in 2000. Things get better but
won’t „fix itself” magically. Relationships can be tough, but that's no reason to give up! Besides, it’s fun!
3. Introduction ► Who we are
⦿ The 2KMM - a small Polish CRO with a global reach.
⦿ 100% R-based:
trial design • DM • datasets • analysis & research • TFL • documents •
consulting • tutoring • making tools
⦿ 28 projects: RCTs + observational studies (in several therapeutic areas)
lots of ad hoc research
4. Introduction ► Who we are
Our specifics:
• No CDISC yet ; data sources based on SQL views
• Lots of planned exploratory analyses with complex scenarios
• Sometimes asked to use dinosaur tools vs. the freshest method widely widespread
• Being a CRO we are not as powerful in decisions as a big pharmaceutical company:
o a Sponsor may have own vision and demand us to follow it
o our proposals may be questioned (sometimes without a discussion)
• Very differentiated requests from different sponsors:
• make tables like X, make table like Y
• use this format, use that format
• we prefer X, we HATE X. ABC is important vs. ABC is negligible vs. please decide
It’s difficult to work out a common approach, workflow, template.
5. Introduction ► History
5
⦿ When we started, a few questions had to be answered
Can we rely on R entirely? Will it suffice? Everyone around uses SAS
What are the hidden costs of using open source (no free lunch)
Can we trust R? How to validate it?
What packages do we need to start? Collection of requirements
How to organize the working environment (SOPs, technical aspects)
In general – we were rather optimistic in 2018
6. Introduction ► Opinions
⦿ After 5 years we have some opinions:
Did R suffice to complete our work? Mostly…
Could we just „launch R and focus on the work”? Partially…
Could we trust R on faith? Did we fail? No. / Painfully.
What are the hidden costs of using open source: Non-negligible
How many packages we ended up with? 230+
Describe the experience briefly? annoyance, determination,
fixing stuff, reporting issues,
researching, satisfaction
Are we happy with R? Will we stay with? It’s a tough ♥ / Yes
Why? It’s flexible. It’s getting better.
It’s worth. We learned „HOW”
7. Introduction ► Costs
This is not true that using free software does not cost a penny. It costs the time
that one could spend doing the analysis, spent on:
Collecting the library of necessary tools. That’s not easy, will show why.
Validating the selected tools (making sure 2+2=4)
Realizing, that the important package fails or has gone (hello, CRAN!)
contacting authors or the entire group, reporting issues at GitHub
searching for a replacement (+validation) - may lack features
If no response - researching the problem on your own
Paying for external consultancy, books, pay-walled articles to move on
8. Introduction ► Costs
How much did it cost?
An equivalent of a few 1yr licenses of a „good commercial software”.
Wait, what!? So where are the savings then?!
1. The cost is distributed over time (a year, say)
2. Such a big cost is rather one-off - at the beginning of the process
Occasional costs will take place, though (new versions, „retired packages”)
3. You get what you need (mostly), not what others decide you need
4. Once done – can be reused infinite number of times (no per-user licenses)
5. You better control what you have – because you are the one who made it.
6. You get the code – at least a little chance to fix things with own hands
7. As long you as your repository (library) is validated and frozen – you sleep well.
9. Introduction ► Costs
“Oh, c’mon. You have all the codes! It’s open source! Why don’t you just fix the
problems and go back to work? What’s the problem? I think you exaggerate!
Resources (staff + time + money) allocated to “employ the Open Source.”
Big company Small company
15 specialists
X$
2 specialists
Y$, Y << X
10. Introduction ► Step 1: organization of work ; technical infrastructure + SOPs
projects R
3.x
VHDX container
V
P
N
wild
validated
SOP SOP
SOP SOP
11. Introduction ► Step 1: organization of work ; portable R
https://sourceforge.net/projects/rportable/
Allows one to test new stuff and mix different
versions of R core in a single analysis.
Easy – no installation (VMs, containers), no extra
packages / dependencies / setups
No elevated rights needed
Regular directories – easy management!
When „matured”, can be packed into a VHDX cont.
Easily selectable as the current engine in RStudio:
13. Introduction ► Step 1: organization of work ; portable R
This allows us to mix not only
packages in different versions (with
all necessary dependencies) in a
single analysis, but also to mix
versions of the R core itself, when
certain package needs higher/lower
version of the R core.
Combines RPortable + rscript.exe +
convention of naming [input data]-
[output results] files.
Each version-dependent code
knows where to read the data from
and where to store the results.
Fully isolated codes. Data
exchanged via regular R objets
(RDS or feather)
Warning in install.packages :
package ‘emmeans’ is not available (for R version 3.6.3)
16. Introduction ► Step 2: Simple, automated template-based workflow
DOCx template
- Headers, footers
- Styles
- Content placeholders
definitions
definitions
Header
Footer
Title
Header
Footer
Report
ID A B C
1 A B C
2 A B C
DOCx report
- Headers, footers preserved
- Styles utilized
- Placeholders hold actual T/F/Ls
HTML log
- All R commands
- All messages
- All (simplified) results
Trial ABC
LOG
Author: xxx Date: xxxx
print(„Hi!”)
[1] Hi!
library(…)
library(…)
library(…)
…..
…..
…..
…..
Rmarkdown „manager”
- Reads the DOCx template for TODOs
- Does the „TODOs”
- Replaces „TODOs” with TFLs
- Becomes the HTML LOG
17. Introduction ► Step 2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
RMarkdown file
- Creates the environment
- Reads the DOCx template
- Loads the Word parsing „engine”
- The engine:
- iterates through definitions of placeholders
- parses the fields,
- loads the R files per convention
- executes the code
- replaces placeholders with actual TFLs
- Auto-updates (appends) the HTML to LOG
library(…)
library(…)
library(…)
…..
…..
…..
….. DOCx reading engine
18. Introduction ► Step 2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
RMarkdown file
- Creates the environment
- Reads the DOCx template
- Loads the Word parsing „engine”
- The engine:
- iterates through definitions of placeholders
- parses the fields,
- loads the R files per convention
- executes the code
- replaces placeholders with actual TFLs
- Auto-updates (appends) the HTML to LOG
library(…)
library(…)
library(…)
…..
…..
…..
….. DOCx reading engine
## Preparing the objects storing the content of the report in both MS Word and MS Excel
formats
```{r}
word_report_document_name <- paste0(target_report_document_name, ".docx")
excel_report_document_name <- paste0(target_report_document_name, ".xlsx")
word_report_template_name <- paste0(target_report_document_name, "_template.docx")
doc_report <- read_docx(word_report_document_name)
doc_content <- docx_summary(doc_report)
xls_report <- createWorkbook()
```
# Data analysis
```{r child="rendering_engine.rmd", echo=TRUE, results='asis'}
```
```{r}
print(doc_report, target = word_report_document_name)
saveWorkbook(wb = xls_report, file = excel_report_document_name, overwrite = TRUE)
```
19. Introduction ► Step 2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
RMarkdown file
- Creates the environment
- Reads the DOCx template
- Loads the Word parsing „engine”
- The engine:
- iterates through definitions of placeholders
- parses the fields,
- loads the R files per convention
- executes the code
- replaces placeholders with actual TFLs
- Auto-updates (appends) the HTML to LOG
library(…)
library(…)
library(…)
…..
…..
…..
….. DOCx reading engine
## Preparing the objects storing the content of the report in both MS Word and MS Excel
formats
```{r}
word_report_document_name <- paste0(target_report_document_name, ".docx")
excel_report_document_name <- paste0(target_report_document_name, ".xlsx")
word_report_template_name <- paste0(target_report_document_name, "_template.docx")
doc_report <- read_docx(word_report_document_name)
doc_content <- docx_summary(doc_report)
xls_report <- createWorkbook()
```
# Data analysis
```{r child="rendering_engine.rmd", echo=TRUE, results='asis'}
```
```{r}
print(doc_report, target = word_report_document_name)
saveWorkbook(wb = xls_report, file = excel_report_document_name, overwrite = TRUE)
```
table_defs <- subset(doc_content, grepl("^[Table]", doc_content$text), text)
table_defs <- gsub("[Table] ", "", table_defs$text)
for (def in table_defs) {
split_defs <- strsplit(def, "@")[[1]][-1]
table_title <- trimws(gsub("title:(.*)", "1", split_defs[grep("^title", split_defs)]))
table_number <- trimws(gsub("table_num:(.*)", "1", split_defs[grep("^table_num", split_defs)]))
force_table_num <- trimws(gsub("force_table_num:(.*)", "1", split_defs[grep("^force_table_num", split_defs)]))
table_sufix <- trimws(gsub("table_sufix:(.*)", "1", split_defs[grep("^table_sufix", split_defs)]))
r_file <- trimws(gsub("r_code:(.*)", "1", split_defs[grep("^r_code", split_defs)]))
r_prn_file <- trimws(gsub("r_printer_code:(.*)", "1", split_defs[grep("^r_printer_code", split_defs)]))
exclude <- trimws(gsub("exclude:(.*)", "1", split_defs[grep("^exclude", split_defs)]))
table_title <- iconv(table_title,from = "UTF-8", to = "UTF-8")
exclude <- ifelse(identical(exclude, character(0)), FALSE, as.logical(exclude))
……………………………………………
if (identical(r_file, character(0)) || r_file == "") {
r_file <- paste0("Table", table_number, table_sufix, ".r")
}
……………………………………………
r_file <- file.path(r_code_location, r_file)
chunk <- c(paste("#### Table ", paste0(table_number, table_sufix), "-", table_title, "n"),
paste("```{r ", r_file, "}n"),
readLines(r_file),
"```n")
cat(knit_child(text = chunk, quiet = TRUE), sep = 'n’)
……………………………………………
}
```
20. Introduction ► Step 2: Simple, automated template-based workflow
DOCx template
- Headers, footers
- Styles
- Content placeholders
definitions
definitions
Header
Footer
Title
Regular R files. Naming convention. Triplet per table.
Prefix:
_data reads data from RDATA / DBI / XML / CSV / XLSX
_an performs the analyses; stores results in RDATA
_print reads the RDATA, generates DOCx tables, XLSx
files, EMF graphs and HTML output for the LOG
Table_01_data.r Table_01_an.r Table_01_print.r
𝑦 = 𝛽0 + 𝛽1X
21. Introduction ► Step2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
Header
Footer
Report
ID A B C
1 A B C
2 A B C
library(…)
library(…)
library(…)
…..
…..
…..
…..
Trial ABC
LOG
Author: xxx Date: xxxx
print(„Hi!”)
[1] Hi!
22. Introduction ► Step2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
Header
Footer
Report
ID A B C
1 A B C
2 A B C
library(…)
library(…)
library(…)
…..
…..
…..
…..
Trial ABC
LOG
Author: xxx Date: xxxx
print(„Hi!”)
[1] Hi!
23. Introduction ► Step 3: defining tasks finding tools making a library
Modelling,
longitudinal
analysis
Inference
(testing, CIs, MCP)
Summaries Effect size
Advanced
survival
Making complex
tables
Dose – Response
PK, PD, DF
Questionnaires
Generating
documents
(DOCx, RTF, PDF)
Documenting
(log) the analysis
Data I/O Technical /
Programming
Trial design &
simulation
Plotting
Randomization
Data
manipulation
Meta-analysis CDISC-related
Missing data –
patterns and
imputation
Model
diagnostics
27. Introduction ► Step 3: defining tasks finding tools Hall of fame! (incomplete!)
28. Introduction ► Step 3: defining tasks finding tools Hall of fame! (incomplete!)
29. Introduction ► So many sources of packages!
Packages
GitHub
CRAN
CRAN
archive
RForge
External
(PKfit)
Bioconductor
• Versions may differ
• Different ways of
reporting issues
30. Introduction ► So many sources of packages!
Packages
GitHub
CRAN
CRAN
archive
RForge
External
(PKfit)
Bioconductor
• Versions may differ
• Different ways of
reporting issues
32. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
As long, as somebody uses just the basic tools, problems may never occur.
And this scope may be just sufficient for quite a lot scenarios!
• “group by” summaries with N, %, mean, median, SD, Q1, Q3, min, max…
• aov()
• kruskal.test(), wilcox.test(), t.test()…
• lmer(post_value ~ treatment * time + baseline + baseline:time + (1|PatID))
• plot(survfit(Surv(time, status) ~ treatment))
BTW, did you see median()?
Is it equal to quantile()[“50%”]? Always?
34. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
Like it or not – the fact is that SAS® IS the industry standard in clinical trials and
people will use it to re-create your analyses – and NATURALLY ask if the
numbers don’t agree.
SAS®
Regulatory
agency
Journal
Sponsor-side
biostat team
Your
colleague
Validator
- it’s not about a “crusade”:
“R is better! No! SAS® is better!
No! Excel is better!”
- it’s not about favoring anyone
(“you think it’s better because
expensive!?”)
- It’s about the reality.
35. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
If they ask you about the discrepancies, you can:
1. ignore it (can you?),
2. say „I don’t know, it just happened, but R is right!”
3. investigate it and respond:
1. both are right, just different approach 🤷
2. well, R is wrong, I’m gonna fix it or message the authors
But to respond – you need to know what happened.
A much worse situation: NOBODY found a difference, and you just
published the results with errors.
36. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
We do not care, if a package has a „good marketing”. It must be working well.
Has vignettes!
Has active community!
5 in rankings. YouTube tutorials.
Top popular download on GitHub
Has unfixed errors that nobody cares…
37. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
• nlme: Priority: recommended; linear mixed models with almost all the stuff SAS has
• MCPMod: Design and Analysis of Dose-Finding Studies
• PMCMRplus Lots of popular non-param stuff, dose repsonse findings - Williams
• MASS Priority: recommended; lots of stuff, including glmmPQL!
• boot Priority: recommended
• nparcomp Lots of non-parametric methods
• frailtySurv Shared frailty models
• rms Strategies for regression modeling by Prof. Harrell
• geesmv Small-sample Morell’s correction for the GEE sandwich SEs
• ipw Inverse-Probability Weighting – for GEE under MAR
• multxpert Common Multiple Testing Procedures and Gatekeeping Procedures by Prof. Dmitrienko
• PropCIs A must have – CIs for proportion
• pkfit One of the most important tools for PK; Not even on CRAN
• bear One of the best available tools for the PK, not even on CRAN
• cmprsk Survival with competing risks
These packages have no marketing. Would you exclude them from your toolkit?
38. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7327187/
39. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
And the real problem is that R is discrepant not only against SAS, but even… itself
40. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
After failing several times, we finally decided to validate as much as possible.
This consumes a lot of time and efforts. But make us sleeping better.
Package Function Version Dataset Test Completed Soft1 Soft2 Soft3 Discrep. Decision Justif.
pkg1 fn1() 0.6.2 Trial 1 #23 OK OK OK FAILED …………. OK ………
pkg1 fn1() 0.6.2 Journal 2 #23 OK OK FAILED OK ………… FAILED ………
pkg1 fn1() 0.6.2 Journal 2 #24 OK OK OK OK ………… OK ………
pkg1 fn2() 0.6.2 Journal 2 #25 OK FAILED FAILED FAILED ………… FAILED ………
Validation
Reference software
Textbook formulas
– by hand
Other trusted package Published results: journals/books
Published
results: manuals
Code inspection
41. Challenges ► Ocean of possibilities. But be careful! It’s deep!
Open Source gives you the ocean of possibilities (for doing THE SAME)!
OK! Diversity is overall good, but without overdoing! Let’s imagine I want
procedure ABC. R has 10 functions in 5 packages to do ABC in 8 ways. My
day has only 24 hours and I have my work, and the lifer after hours.
42. Challenges ► Documentation
Documentation quality varies a lot. From dedicated web-books with numerous
examples ( https://ardata-fr.github.io/flextable-book/ ) to just raw manual with
no formula and references to a paid article or rare book.
SAS, NCSS, SPSS, Stata – have awesome tutorials, manuals – almost
courses in statistics NCSS gives even the input data and results!
Just a basic manual You can do PhD with it!
43. Challenges ► Why cannot things be simple?
SAS ®: PROC MIXED EMPIRICAL… REPEATED … CS … KR … LSMEANS
R: Kenward-Roger? … a-ha! Use lmer4! But wait, I want a marginal model with CS.
Random-intercept ≠ CS for negative within-subject correlations! I could use glmmTMB
for this, but pbkrtest doesn’t support it.
But there’s nlme! Take nlme::gls(). But pbkrtest doesn’t support nlme::gls().
OK then, let’s use Satterthwaite!
OK. nlme::gls() + emmeans (for LS-means + Satterthwaite). Now I want the robust HC0
(„sandwich”) estimator. Get clubSandwich and use the emmeans to provide the
adjusted Var-Cov. Follow it by emmeans::joint_tests(). Double check the DF, as
car::Anova() may have a problem here.
Done! Sigh! … Did you check GitHub, if there are no opened issues?
Statistics UX
47. Challenges ► Let’s combine it together!
START!
Package
removed
from CRAN
Search for a
replacement
Email the author…
Create new issue.
Learn the new package
What does this
thing do?!
Something is wrong!
It works!
FIXED?
What now!?
Another package is
needed
It works!
Sorry, I’m
busy.
No, it
doesn’t
Partially
managed…
48. Future plans
We plan:
• To research a couple of new tools:
• For work: MMRM (!)
• For CDISC: admiral, sassy, definer, metacore
• For RTF: rtftables, gt
• Out of curiosity: tplyr
• For technical work: box
• To focus on CDISC and a preparation to the first big submission.
• To extend the numerical validation of packages
49. Overall impression
Employing Open Source means accepting the consequences.
The efforts, costs, extra work - cannot be taken lightly in a small CRO.
But it is definitely worth the efforts.
In moments of doubt, it’s good to remember, that no big deals come easy.
R is and will be our friend. Even if a demanding one