This document summarizes a 5-year journey of using R as the sole statistical analysis software at a CRO. Some key points:
- Initially there were questions around whether R would be sufficient, what hidden costs there may be to using open source software, how to validate and organize the working environment, and which packages would be needed.
- After 5 years of experience, the CRO found that R mostly sufficed for their work but using open source "is not free" as time must be spent collecting tools, validating them, dealing with package failures, and reporting issues. This amounts to costs equivalent to commercial software licenses.
- The CRO developed a simple, automated, template-based workflow
DocOps: Documentation at the Speed of AgileMary Connor
Presented at Keep Austin Agile 2016: How to we make documentation "Agile", given the Agile Manifesto? How do you get into the Definition of Done? What does "DocOps" mean, in the simplest and broadest terms? What should your requirements be for a DocOps transformation, and how do you find a tool stack that fits them? Where do you start, and how do you escape a waterfall reengineering of your legacy docs?
EuroSTAR Software Testing Conference 2009 presentation on Low Budget Tooling - Excel-ent by Mattias Diagl. See more at conferences.eurostarsoftwaretesting.com/past-presentations/
DocOps: Documentation at the Speed of AgileMary Connor
Presented at Keep Austin Agile 2016: How to we make documentation "Agile", given the Agile Manifesto? How do you get into the Definition of Done? What does "DocOps" mean, in the simplest and broadest terms? What should your requirements be for a DocOps transformation, and how do you find a tool stack that fits them? Where do you start, and how do you escape a waterfall reengineering of your legacy docs?
EuroSTAR Software Testing Conference 2009 presentation on Low Budget Tooling - Excel-ent by Mattias Diagl. See more at conferences.eurostarsoftwaretesting.com/past-presentations/
Managing Software Dependencies and the Supply Chain_ MIT EM.S20.pdfAndrew Lamb
Slides presented at an MIT seminar: Wrangling Software Projects
Provides both a commercial and an open-source perspective on the benefits, costs, and risks of taking on dependencies.
Automation and machine learning in the enterprisealphydan
How can a company start automating repetitive tasks? What are the standard tools of the trade? What kind of processes could be automated with a relatively small investment?
Every contemporary organization sooner than later has to solve the problem of document exchange, especially in such tightly regulated areas as health care and government services. Lots and lots of official forms have to be filled and processed exactly as mandated – preserving standards pixel by pixel. Another big issue is reporting, because end-users now want very high flexibility and customization without losing any performance (and without paying too much!).
Meeting these requirements in not very trivial, but definitely doable! This presentation is focused on one of the possible approaches – using database not only for data processing, but for creating final documents. A number of examples from real-life projects will show how PDF and Excel files can be efficiently prepared without leaving your server – as long as you have appropriate tools and mechanisms: some of them are home-grown (reporting repositories and XLS generators), some of them are developed by others (PL/PDF, ITEXT, PL/JSON).
This presentation will show how strengths and weaknesses of this approach, explain reasons of such technology selection and history of experience.
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...4Science
In this insightful presentation we will provide a profound analysis of the complexities institutions face during the migration process. With a focus on real-world examples, the presentation will explore challenges encountered when transitioning from older DSpace versions and diverse platforms such as EPrints and Invenio. The session will also offer a sneak peek into DSpace 8, anticipated to reshape the landscape of digital repositories.
One of the main advantages of PHP is that it allows you and your company to build up projects in no time and with immediate feedback and business value. Sometimes, however, fast growth and unprevented complexities could make your codebase more and more difficult to manage as time passes and new features are added.Domain Driven Design can be an elegant solution to the problem, but introducing it in mid-large sized projects is not always easy: you have to deal with difficulties at technical, team and knowledge levels. This talk focuses on how to approach the change in your codebase and in your team mindset without breaking legacy code or stopping the development in favor of neverending refactoring sessions.
Logistic regression vs. logistic classifier. History of the confusion and the...Adrian Olszewski
Despite the wrong (yet widespread) claim, that "logistic regression is not a regression", it's one of the key regression tool in experimental research, like the clinical trials. It is used also for advanced testing hypotheses.
The logistic regression is part of the GLM (Generalized Linear Model) regression framework. I expanded this topic here: https://medium.com/@r.clin.res/is-logistic-regression-a-regression-46dcce4945dd
Logistic regression - one of the key regression tools in experimental researchAdrian Olszewski
Despite the wrong (yet widespread) claim, that "logistic regression is not a regression", it's one of the key regression tool in experimental research, like the clinical trials. It is used also for advanced testing hypotheses.
The logistic regression is part of the GLM (Generalized Linear Model) regression framework.
More Related Content
Similar to Meet a 100% R-based CRO - The summary of a 5-year journey
Managing Software Dependencies and the Supply Chain_ MIT EM.S20.pdfAndrew Lamb
Slides presented at an MIT seminar: Wrangling Software Projects
Provides both a commercial and an open-source perspective on the benefits, costs, and risks of taking on dependencies.
Automation and machine learning in the enterprisealphydan
How can a company start automating repetitive tasks? What are the standard tools of the trade? What kind of processes could be automated with a relatively small investment?
Every contemporary organization sooner than later has to solve the problem of document exchange, especially in such tightly regulated areas as health care and government services. Lots and lots of official forms have to be filled and processed exactly as mandated – preserving standards pixel by pixel. Another big issue is reporting, because end-users now want very high flexibility and customization without losing any performance (and without paying too much!).
Meeting these requirements in not very trivial, but definitely doable! This presentation is focused on one of the possible approaches – using database not only for data processing, but for creating final documents. A number of examples from real-life projects will show how PDF and Excel files can be efficiently prepared without leaving your server – as long as you have appropriate tools and mechanisms: some of them are home-grown (reporting repositories and XLS generators), some of them are developed by others (PL/PDF, ITEXT, PL/JSON).
This presentation will show how strengths and weaknesses of this approach, explain reasons of such technology selection and history of experience.
“Adoption DSpace 7 and 8 Challenges and Solutions from Real Migration Experie...4Science
In this insightful presentation we will provide a profound analysis of the complexities institutions face during the migration process. With a focus on real-world examples, the presentation will explore challenges encountered when transitioning from older DSpace versions and diverse platforms such as EPrints and Invenio. The session will also offer a sneak peek into DSpace 8, anticipated to reshape the landscape of digital repositories.
One of the main advantages of PHP is that it allows you and your company to build up projects in no time and with immediate feedback and business value. Sometimes, however, fast growth and unprevented complexities could make your codebase more and more difficult to manage as time passes and new features are added.Domain Driven Design can be an elegant solution to the problem, but introducing it in mid-large sized projects is not always easy: you have to deal with difficulties at technical, team and knowledge levels. This talk focuses on how to approach the change in your codebase and in your team mindset without breaking legacy code or stopping the development in favor of neverending refactoring sessions.
Logistic regression vs. logistic classifier. History of the confusion and the...Adrian Olszewski
Despite the wrong (yet widespread) claim, that "logistic regression is not a regression", it's one of the key regression tool in experimental research, like the clinical trials. It is used also for advanced testing hypotheses.
The logistic regression is part of the GLM (Generalized Linear Model) regression framework. I expanded this topic here: https://medium.com/@r.clin.res/is-logistic-regression-a-regression-46dcce4945dd
Logistic regression - one of the key regression tools in experimental researchAdrian Olszewski
Despite the wrong (yet widespread) claim, that "logistic regression is not a regression", it's one of the key regression tool in experimental research, like the clinical trials. It is used also for advanced testing hypotheses.
The logistic regression is part of the GLM (Generalized Linear Model) regression framework.
I can see an issue that disturbs the statistician in me. The tremendous overuse of the linear regression. Many data scientists really seem not to care of the assumptions and consequences (hiding yourself behind "all models are wrong"臘♂️) and use it for everything: counts, proportions/percentages, censored/ truncated/repeated/clustered/ordinal data, data with mean-variance dependency. That's a simple way to problems.
First, because in these cases the assumptions are violated, so the estimation may be biased and inference - spoiled. Second, because there are numerous more appropriate models. Look at the attached a simplified diagram. I update it constantly.
I am going to show you why the context-less outlier detection methods proposed practically everywhere, are almost useless, and why t̲h̲o̲u̲g̲h̲t̲l̲e̲s̲s̲ removal of outliers, instead of embracing and LEARNING from them, is so wrong.
After reading the attached document, you should be able to understand why people, who remove outliers blindly, without attempting to investigate it and who don't even try to learn the context, either completely don't understand what they do, don't care, or they lie intentionally.
Don't be an ignoramus, don't lie, investigate both missing data and outliers.
Outliers? I'll show you cases, where the truly worrying data aren't even visible.
The use of R statistical package in controlled infrastructure. The case of Cl...Adrian Olszewski
Facts and myths on the use of the R statistical package in controlled, validated environments by the example of Clinical Research in the pharmaceutical industry. This is the first part constituting the introduction. Technical details will be presented in the part II.
This document was presented at a conference organized by Polish National Group of the International Society for Clinical Biostatistics.
GNU R in Clinical Research and Evidence-Based MedicineAdrian Olszewski
Is GNU R (an environment for statistical computing) suitable enough for Biostatisticians involved in Clinical Research? Can it replace or support SAS in this area? Well, I think this presentation may help to remove any doubts. If you are a Biostatistician (and probably a SAS user), you may find it useful.
The presentation is under constant improvement.
You can find it also on CRAN (contributed documentation) and at http://www.r-clinical-research.com
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Essentials of Automations: Optimizing FME Workflows with Parameters
Meet a 100% R-based CRO - The summary of a 5-year journey
1. Meet a 100% R-based CRO
The summary of a 5-year journey
Adrian Olszewski
Principal Biostatistician at 2KMM CRO
The R/Pharma 2022 Conference, Nov 10th 2022
10min
www.2kmm.eu
aolszewski@2kmm.pl
2. Disclaimer
2
This presentation shows the biostatistician’s perspective first.
Lots of exploratory research, involving tens of statistical tests, complex survival models and non-parametric
methods. Here producing TFLs is important but secondary. I need the NUMBERS first to populate them.
It’s not to „blame” or „unfairly criticize”.
My job is to analysis trial’s data on time and within the budget.
If something does not work so I cannot meet the deadline, fixing things exceeds the budget, the situation
seems hopeless – there’s no time for sentiments, ideology, and hiding problems. I’m going to be held
accountable for the effects of my work, not for my „love for the tool”.
„You get what you pay for. It’s a free software. Stop complaining.
The fact that something is „for free” does not mean it cannot be improved. The first step is to admit problems
exist, diagnoze them and be honest. It needs a sober assessment of the situation to counteract effectively.
So why do you „waste your time”? Buy XYZ® and be happy
I really want to make things better. I ❤️ R. If i did not, I’d have abandoned it in 2000. Things get better but
won’t „fix itself” magically. Relationships can be tough, but that's no reason to give up! Besides, it’s fun!
3. Introduction ► Who we are
⦿ The 2KMM - a small Polish CRO with a global reach.
⦿ 100% R-based:
trial design • DM • datasets • analysis & research • TFL • documents •
consulting • tutoring • making tools
⦿ 28 projects: RCTs + observational studies (in several therapeutic areas)
lots of ad hoc research
4. Introduction ► Who we are
Our specifics:
• No CDISC yet ; data sources based on SQL views
• Lots of planned exploratory analyses with complex scenarios
• Sometimes asked to use dinosaur tools vs. the freshest method widely widespread
• Being a CRO we are not as powerful in decisions as a big pharmaceutical company:
o a Sponsor may have own vision and demand us to follow it
o our proposals may be questioned (sometimes without a discussion)
• Very differentiated requests from different sponsors:
• make tables like X, make table like Y
• use this format, use that format
• we prefer X, we HATE X. ABC is important vs. ABC is negligible vs. please decide
It’s difficult to work out a common approach, workflow, template.
5. Introduction ► History
5
⦿ When we started, a few questions had to be answered
Can we rely on R entirely? Will it suffice? Everyone around uses SAS
What are the hidden costs of using open source (no free lunch)
Can we trust R? How to validate it?
What packages do we need to start? Collection of requirements
How to organize the working environment (SOPs, technical aspects)
In general – we were rather optimistic in 2018
6. Introduction ► Opinions
⦿ After 5 years we have some opinions:
Did R suffice to complete our work? Mostly…
Could we just „launch R and focus on the work”? Partially…
Could we trust R on faith? Did we fail? No. / Painfully.
What are the hidden costs of using open source: Non-negligible
How many packages we ended up with? 230+
Describe the experience briefly? annoyance, determination,
fixing stuff, reporting issues,
researching, satisfaction
Are we happy with R? Will we stay with? It’s a tough ♥ / Yes
Why? It’s flexible. It’s getting better.
It’s worth. We learned „HOW”
7. Introduction ► Costs
This is not true that using free software does not cost a penny. It costs the time
that one could spend doing the analysis, spent on:
Collecting the library of necessary tools. That’s not easy, will show why.
Validating the selected tools (making sure 2+2=4)
Realizing, that the important package fails or has gone (hello, CRAN!)
contacting authors or the entire group, reporting issues at GitHub
searching for a replacement (+validation) - may lack features
If no response - researching the problem on your own
Paying for external consultancy, books, pay-walled articles to move on
8. Introduction ► Costs
How much did it cost?
An equivalent of a few 1yr licenses of a „good commercial software”.
Wait, what!? So where are the savings then?!
1. The cost is distributed over time (a year, say)
2. Such a big cost is rather one-off - at the beginning of the process
Occasional costs will take place, though (new versions, „retired packages”)
3. You get what you need (mostly), not what others decide you need
4. Once done – can be reused infinite number of times (no per-user licenses)
5. You better control what you have – because you are the one who made it.
6. You get the code – at least a little chance to fix things with own hands
7. As long you as your repository (library) is validated and frozen – you sleep well.
9. Introduction ► Costs
“Oh, c’mon. You have all the codes! It’s open source! Why don’t you just fix the
problems and go back to work? What’s the problem? I think you exaggerate!
Resources (staff + time + money) allocated to “employ the Open Source.”
Big company Small company
15 specialists
X$
2 specialists
Y$, Y << X
10. Introduction ► Step 1: organization of work ; technical infrastructure + SOPs
projects R
3.x
VHDX container
V
P
N
wild
validated
SOP SOP
SOP SOP
11. Introduction ► Step 1: organization of work ; portable R
https://sourceforge.net/projects/rportable/
Allows one to test new stuff and mix different
versions of R core in a single analysis.
Easy – no installation (VMs, containers), no extra
packages / dependencies / setups
No elevated rights needed
Regular directories – easy management!
When „matured”, can be packed into a VHDX cont.
Easily selectable as the current engine in RStudio:
13. Introduction ► Step 1: organization of work ; portable R
This allows us to mix not only
packages in different versions (with
all necessary dependencies) in a
single analysis, but also to mix
versions of the R core itself, when
certain package needs higher/lower
version of the R core.
Combines RPortable + rscript.exe +
convention of naming [input data]-
[output results] files.
Each version-dependent code
knows where to read the data from
and where to store the results.
Fully isolated codes. Data
exchanged via regular R objets
(RDS or feather)
Warning in install.packages :
package ‘emmeans’ is not available (for R version 3.6.3)
16. Introduction ► Step 2: Simple, automated template-based workflow
DOCx template
- Headers, footers
- Styles
- Content placeholders
definitions
definitions
Header
Footer
Title
Header
Footer
Report
ID A B C
1 A B C
2 A B C
DOCx report
- Headers, footers preserved
- Styles utilized
- Placeholders hold actual T/F/Ls
HTML log
- All R commands
- All messages
- All (simplified) results
Trial ABC
LOG
Author: xxx Date: xxxx
print(„Hi!”)
[1] Hi!
library(…)
library(…)
library(…)
…..
…..
…..
…..
Rmarkdown „manager”
- Reads the DOCx template for TODOs
- Does the „TODOs”
- Replaces „TODOs” with TFLs
- Becomes the HTML LOG
17. Introduction ► Step 2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
RMarkdown file
- Creates the environment
- Reads the DOCx template
- Loads the Word parsing „engine”
- The engine:
- iterates through definitions of placeholders
- parses the fields,
- loads the R files per convention
- executes the code
- replaces placeholders with actual TFLs
- Auto-updates (appends) the HTML to LOG
library(…)
library(…)
library(…)
…..
…..
…..
….. DOCx reading engine
18. Introduction ► Step 2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
RMarkdown file
- Creates the environment
- Reads the DOCx template
- Loads the Word parsing „engine”
- The engine:
- iterates through definitions of placeholders
- parses the fields,
- loads the R files per convention
- executes the code
- replaces placeholders with actual TFLs
- Auto-updates (appends) the HTML to LOG
library(…)
library(…)
library(…)
…..
…..
…..
….. DOCx reading engine
## Preparing the objects storing the content of the report in both MS Word and MS Excel
formats
```{r}
word_report_document_name <- paste0(target_report_document_name, ".docx")
excel_report_document_name <- paste0(target_report_document_name, ".xlsx")
word_report_template_name <- paste0(target_report_document_name, "_template.docx")
doc_report <- read_docx(word_report_document_name)
doc_content <- docx_summary(doc_report)
xls_report <- createWorkbook()
```
# Data analysis
```{r child="rendering_engine.rmd", echo=TRUE, results='asis'}
```
```{r}
print(doc_report, target = word_report_document_name)
saveWorkbook(wb = xls_report, file = excel_report_document_name, overwrite = TRUE)
```
19. Introduction ► Step 2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
RMarkdown file
- Creates the environment
- Reads the DOCx template
- Loads the Word parsing „engine”
- The engine:
- iterates through definitions of placeholders
- parses the fields,
- loads the R files per convention
- executes the code
- replaces placeholders with actual TFLs
- Auto-updates (appends) the HTML to LOG
library(…)
library(…)
library(…)
…..
…..
…..
….. DOCx reading engine
## Preparing the objects storing the content of the report in both MS Word and MS Excel
formats
```{r}
word_report_document_name <- paste0(target_report_document_name, ".docx")
excel_report_document_name <- paste0(target_report_document_name, ".xlsx")
word_report_template_name <- paste0(target_report_document_name, "_template.docx")
doc_report <- read_docx(word_report_document_name)
doc_content <- docx_summary(doc_report)
xls_report <- createWorkbook()
```
# Data analysis
```{r child="rendering_engine.rmd", echo=TRUE, results='asis'}
```
```{r}
print(doc_report, target = word_report_document_name)
saveWorkbook(wb = xls_report, file = excel_report_document_name, overwrite = TRUE)
```
table_defs <- subset(doc_content, grepl("^[Table]", doc_content$text), text)
table_defs <- gsub("[Table] ", "", table_defs$text)
for (def in table_defs) {
split_defs <- strsplit(def, "@")[[1]][-1]
table_title <- trimws(gsub("title:(.*)", "1", split_defs[grep("^title", split_defs)]))
table_number <- trimws(gsub("table_num:(.*)", "1", split_defs[grep("^table_num", split_defs)]))
force_table_num <- trimws(gsub("force_table_num:(.*)", "1", split_defs[grep("^force_table_num", split_defs)]))
table_sufix <- trimws(gsub("table_sufix:(.*)", "1", split_defs[grep("^table_sufix", split_defs)]))
r_file <- trimws(gsub("r_code:(.*)", "1", split_defs[grep("^r_code", split_defs)]))
r_prn_file <- trimws(gsub("r_printer_code:(.*)", "1", split_defs[grep("^r_printer_code", split_defs)]))
exclude <- trimws(gsub("exclude:(.*)", "1", split_defs[grep("^exclude", split_defs)]))
table_title <- iconv(table_title,from = "UTF-8", to = "UTF-8")
exclude <- ifelse(identical(exclude, character(0)), FALSE, as.logical(exclude))
……………………………………………
if (identical(r_file, character(0)) || r_file == "") {
r_file <- paste0("Table", table_number, table_sufix, ".r")
}
……………………………………………
r_file <- file.path(r_code_location, r_file)
chunk <- c(paste("#### Table ", paste0(table_number, table_sufix), "-", table_title, "n"),
paste("```{r ", r_file, "}n"),
readLines(r_file),
"```n")
cat(knit_child(text = chunk, quiet = TRUE), sep = 'n’)
……………………………………………
}
```
20. Introduction ► Step 2: Simple, automated template-based workflow
DOCx template
- Headers, footers
- Styles
- Content placeholders
definitions
definitions
Header
Footer
Title
Regular R files. Naming convention. Triplet per table.
Prefix:
_data reads data from RDATA / DBI / XML / CSV / XLSX
_an performs the analyses; stores results in RDATA
_print reads the RDATA, generates DOCx tables, XLSx
files, EMF graphs and HTML output for the LOG
Table_01_data.r Table_01_an.r Table_01_print.r
𝑦 = 𝛽0 + 𝛽1X
21. Introduction ► Step2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
Header
Footer
Report
ID A B C
1 A B C
2 A B C
library(…)
library(…)
library(…)
…..
…..
…..
…..
Trial ABC
LOG
Author: xxx Date: xxxx
print(„Hi!”)
[1] Hi!
22. Introduction ► Step2: Simple, automated template-based workflow
definitions
definitions
Header
Footer
Title
Header
Footer
Report
ID A B C
1 A B C
2 A B C
library(…)
library(…)
library(…)
…..
…..
…..
…..
Trial ABC
LOG
Author: xxx Date: xxxx
print(„Hi!”)
[1] Hi!
23. Introduction ► Step 3: defining tasks finding tools making a library
Modelling,
longitudinal
analysis
Inference
(testing, CIs, MCP)
Summaries Effect size
Advanced
survival
Making complex
tables
Dose – Response
PK, PD, DF
Questionnaires
Generating
documents
(DOCx, RTF, PDF)
Documenting
(log) the analysis
Data I/O Technical /
Programming
Trial design &
simulation
Plotting
Randomization
Data
manipulation
Meta-analysis CDISC-related
Missing data –
patterns and
imputation
Model
diagnostics
27. Introduction ► Step 3: defining tasks finding tools Hall of fame! (incomplete!)
28. Introduction ► Step 3: defining tasks finding tools Hall of fame! (incomplete!)
29. Introduction ► So many sources of packages!
Packages
GitHub
CRAN
CRAN
archive
RForge
External
(PKfit)
Bioconductor
• Versions may differ
• Different ways of
reporting issues
30. Introduction ► So many sources of packages!
Packages
GitHub
CRAN
CRAN
archive
RForge
External
(PKfit)
Bioconductor
• Versions may differ
• Different ways of
reporting issues
32. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
As long, as somebody uses just the basic tools, problems may never occur.
And this scope may be just sufficient for quite a lot scenarios!
• “group by” summaries with N, %, mean, median, SD, Q1, Q3, min, max…
• aov()
• kruskal.test(), wilcox.test(), t.test()…
• lmer(post_value ~ treatment * time + baseline + baseline:time + (1|PatID))
• plot(survfit(Surv(time, status) ~ treatment))
BTW, did you see median()?
Is it equal to quantile()[“50%”]? Always?
34. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
Like it or not – the fact is that SAS® IS the industry standard in clinical trials and
people will use it to re-create your analyses – and NATURALLY ask if the
numbers don’t agree.
SAS®
Regulatory
agency
Journal
Sponsor-side
biostat team
Your
colleague
Validator
- it’s not about a “crusade”:
“R is better! No! SAS® is better!
No! Excel is better!”
- it’s not about favoring anyone
(“you think it’s better because
expensive!?”)
- It’s about the reality.
35. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
If they ask you about the discrepancies, you can:
1. ignore it (can you?),
2. say „I don’t know, it just happened, but R is right!”
3. investigate it and respond:
1. both are right, just different approach 🤷
2. well, R is wrong, I’m gonna fix it or message the authors
But to respond – you need to know what happened.
A much worse situation: NOBODY found a difference, and you just
published the results with errors.
36. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
We do not care, if a package has a „good marketing”. It must be working well.
Has vignettes!
Has active community!
5 in rankings. YouTube tutorials.
Top popular download on GitHub
Has unfixed errors that nobody cares…
37. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
• nlme: Priority: recommended; linear mixed models with almost all the stuff SAS has
• MCPMod: Design and Analysis of Dose-Finding Studies
• PMCMRplus Lots of popular non-param stuff, dose repsonse findings - Williams
• MASS Priority: recommended; lots of stuff, including glmmPQL!
• boot Priority: recommended
• nparcomp Lots of non-parametric methods
• frailtySurv Shared frailty models
• rms Strategies for regression modeling by Prof. Harrell
• geesmv Small-sample Morell’s correction for the GEE sandwich SEs
• ipw Inverse-Probability Weighting – for GEE under MAR
• multxpert Common Multiple Testing Procedures and Gatekeeping Procedures by Prof. Dmitrienko
• PropCIs A must have – CIs for proportion
• pkfit One of the most important tools for PK; Not even on CRAN
• bear One of the best available tools for the PK, not even on CRAN
• cmprsk Survival with competing risks
These packages have no marketing. Would you exclude them from your toolkit?
38. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7327187/
39. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
And the real problem is that R is discrepant not only against SAS, but even… itself
40. Challenges ► Numerical validation – R vs. SAS®, vs. SPSS®, vs. Stata®, vs… R
After failing several times, we finally decided to validate as much as possible.
This consumes a lot of time and efforts. But make us sleeping better.
Package Function Version Dataset Test Completed Soft1 Soft2 Soft3 Discrep. Decision Justif.
pkg1 fn1() 0.6.2 Trial 1 #23 OK OK OK FAILED …………. OK ………
pkg1 fn1() 0.6.2 Journal 2 #23 OK OK FAILED OK ………… FAILED ………
pkg1 fn1() 0.6.2 Journal 2 #24 OK OK OK OK ………… OK ………
pkg1 fn2() 0.6.2 Journal 2 #25 OK FAILED FAILED FAILED ………… FAILED ………
Validation
Reference software
Textbook formulas
– by hand
Other trusted package Published results: journals/books
Published
results: manuals
Code inspection
41. Challenges ► Ocean of possibilities. But be careful! It’s deep!
Open Source gives you the ocean of possibilities (for doing THE SAME)!
OK! Diversity is overall good, but without overdoing! Let’s imagine I want
procedure ABC. R has 10 functions in 5 packages to do ABC in 8 ways. My
day has only 24 hours and I have my work, and the lifer after hours.
42. Challenges ► Documentation
Documentation quality varies a lot. From dedicated web-books with numerous
examples ( https://ardata-fr.github.io/flextable-book/ ) to just raw manual with
no formula and references to a paid article or rare book.
SAS, NCSS, SPSS, Stata – have awesome tutorials, manuals – almost
courses in statistics NCSS gives even the input data and results!
Just a basic manual You can do PhD with it!
43. Challenges ► Why cannot things be simple?
SAS ®: PROC MIXED EMPIRICAL… REPEATED … CS … KR … LSMEANS
R: Kenward-Roger? … a-ha! Use lmer4! But wait, I want a marginal model with CS.
Random-intercept ≠ CS for negative within-subject correlations! I could use glmmTMB
for this, but pbkrtest doesn’t support it.
But there’s nlme! Take nlme::gls(). But pbkrtest doesn’t support nlme::gls().
OK then, let’s use Satterthwaite!
OK. nlme::gls() + emmeans (for LS-means + Satterthwaite). Now I want the robust HC0
(„sandwich”) estimator. Get clubSandwich and use the emmeans to provide the
adjusted Var-Cov. Follow it by emmeans::joint_tests(). Double check the DF, as
car::Anova() may have a problem here.
Done! Sigh! … Did you check GitHub, if there are no opened issues?
Statistics UX
47. Challenges ► Let’s combine it together!
START!
Package
removed
from CRAN
Search for a
replacement
Email the author…
Create new issue.
Learn the new package
What does this
thing do?!
Something is wrong!
It works!
FIXED?
What now!?
Another package is
needed
It works!
Sorry, I’m
busy.
No, it
doesn’t
Partially
managed…
48. Future plans
We plan:
• To research a couple of new tools:
• For work: MMRM (!)
• For CDISC: admiral, sassy, definer, metacore
• For RTF: rtftables, gt
• Out of curiosity: tplyr
• For technical work: box
• To focus on CDISC and a preparation to the first big submission.
• To extend the numerical validation of packages
49. Overall impression
Employing Open Source means accepting the consequences.
The efforts, costs, extra work - cannot be taken lightly in a small CRO.
But it is definitely worth the efforts.
In moments of doubt, it’s good to remember, that no big deals come easy.
R is and will be our friend. Even if a demanding one