Journal for Clinical Studies: Close Cooperation Between Data Management and Biostatistics Benefits Data Quality

Volume 8 Issue 542 Journal for Clinical Studies
Technology
Introduction
Every clinical trial is a source of multidimensional data,
analysed in order to answer questions presented in
hypotheses on safety, efficacy and other topics. For the
analysis to be reliable and successful, the recorded data
must be of sufficient quality, i.e. complete, correct and
integral. Keeping invalid or incomplete data in a database
may cause incorrect calculation results, leading to invalid
conclusions and wrong decisions. It is not only a matter
of potential consequences for the sponsor but also
ethics. As there are living humans behind the numbers
generated, this issue must not be taken lightly.
Thus, the process of data validation becomes a key
aspect of every trial. Although the process of checking
and cleaning data is usually performed by the data
management team, a close cooperation with biostatistics
may significantly improve the results by introducing both
statistical knowledge and the ability to create specialised,
programmatic tools and advanced queries giving a good
foundation for deeper and faster data investigations.
Reasons and Types of Invalid Data
Invalid data is usually caused by a human mistake.
EDC forms containing fields insufficiently protected by
edit checks increase the chance for errors. Obviously, it
is always better to prevent than being sorry, and EDC
forms should always be made resistant to errors. Reality,
however, often involves compromises. Text fields allowing
free text to be entered are a good example of that.
Sometimes one has to deal with already poorly-designed
forms. Things get even worse if the EDC software does not
prevent the entry of incorrect values, but rather displays
alerts to the user. This is not uncommon.
Invalid Results
Results of laboratory examinations present a good
example for what can go wrong. Typos, invalid decimal
separators, textual results mixed with numerical ones,
results mixed with both manual comments or messages
generated by the system/machine (“sample hemolysis”,
“bellow assay range”), units entered in many forms,
incorrectly assigned units (e.g. “G/L” confused with
“g/L”), missing lower or upper limits of reference ranges,
switched lower and upper limits of reference ranges,
incorrect assignments between reference ranges and
gender or age, incorrectly assigned flags (high, low,
abnormal), dates and times entered in a wrong format,
just to name a few possible issues. Even automatically
transferred data from a laboratory into an EDC software
through transfer files and programmatic API can be
invalid due to technical issues.
Multiple units make results incomparable and without
the process of unification, they cannot simply be included
in the analysis. Simple group-by analysis enumerates all
the entered units and helps to prepare a list of conversion
factors. It is generally a good idea to make all units SI
compliant.
Missing Observations
While the bad impact of invalid data is obvious, probably
not everyone realises that missing data may affect
statistical computations in no less degree. Things get
worse, if the missingness is not at random, but rather
follows a pattern. Lower sample size may increase
dispersion in data affecting values of descriptive statistics
and estimation of errors. Statistical tests lose their power.
Bias in parameter estimation may be introduced as well.
Design of a trial may become unbalanced, which often
leads to confounding data. Missing observations may
distort distribution shapes. Assumptions of statistical
methods may be violated, which makes statistical
inference unreliable. Missing classes of observations may
make the analysis impossible to perform or interpret.
Advanced imputation techniques are commonly in
use, however they are still only an attempt to fight the
fire. One should not forget that they introduce artificial
data, even if a statistical model says they are possible.
Moreover, things may get really bad when it comes to
misguided data imputation, which may completely distort
the picture of a situation.
Suspicious Observations
Suspicious observations make the next category of issues
which significantly lower the quality of data. Observation
can be considered suspicious for many reasons. Its value
may be too high or too low, acting as an outlier and
significantly affecting results of an analysis or causing
the analysis to fail entirely. Such values may be expected
as typical for a specific disease (ESR, AlAT) or indicate a
human mistake, thus it should be investigated carefully.
But also values looking pretty normal, lying inside a
normal range, may reveal worrying patterns, indicating
a potentially artificial nature of the entered data and
probably fraud. Investigations entailed by this class of
problems are particularly challenging and subtle.
It is not easy to cope with these problems in a
transparent and formalised world of clinical trials when
they happen. Suspicious observations, rich in outliers,
can really damage calculations, distort results and lead
to wrong conclusions. Even if a solution in the form of a
robust statistical method exists, it is challenging to apply,
due to the fact that hypotheses are usually stated a priori
along with a corresponding and closed set of statistical
methods that will be used.
Close Cooperation Between Data Management and
Biostatistics Benefits Data Quality
50_JCS_September2016.indd 42 29/09/2016 16:00:42

Journal for Clinical Studies 43www.jforcs.com
Fraud and Misconduct
Fraud and misconduct, caused intentionally or by
insufficient training, can result in damages which are
often impossible to fix and are very expensive in the end.
One would say that it is far better to have missing rather
than incorrect data. Inappropriate IMP management,
handling or administration procedures, including
accidental switching of drug, placebo or comparator as
well as incorrect examination techniques applied can
damage the data in an unrecoverable manner. This is
because what is done cannot be undone. The sooner it
is detected and eliminated, the better, all the more for
the fact that it often requires long-lasting and difficult
investigation in order to collect all the evidence.
Solutions
After a statistical analysis plan and protocol is prepared
and signed, one does not simply alter things, especially the
set of statistical methods and proceedings, without being
charged with being manipulative. This clearly shows how
extremely important it is to ensure data completeness
and correctness long before the database is finally locked
and the analysis starts. As the process of data validation
and correction is not completed immediately, it involves
a lot of additional communication, consumes time and
resources, and postponing it to a moment shortly before
the lock is very risky.
At KCR we maximise efforts to minimise the risk of
further dealing with invalid and incomplete data, as well as
allowing poorly-trained staff to perform. For this purpose
we have introduced a close cooperation between data
management and biostatistics. While data management
personnel are typically responsible for preparing well-
designed, CDISC-compliant EDC forms and performing
periodic data reviews, the biostatistics department
provides both statistical support and programmatic tools
for advanced data checking and transformation.
The following kinds of support are currently applied
at KCR: preparation-stage analysis; assisted data
validation; creating tools for unassisted data validation;
writing screening programs for unsolicited, ad-hoc data
review; providing solutions for automated scour analysis;
programming solutions for data exchange between
information systems, and last but not least – training and
mentoring.
Preparation-stage Analysis
Every trial starts with a set of common preliminary steps
that have a critical impact on the data quality. One of
the most important prerequisites is to properly design the
EDC forms. The key thing is to ensure its compliance with
CDISC CDASH specification. The second step is to secure
input fields with appropriate edit checks to prevent the
user from entering nonsense data. In addition, text inputs
should be encoded with dictionaries whenever possible.
This refers not only to fields intended to be medically
encoded (MedDRA, LOINC, ICD, etc.) but to any field
of which the content can be organised in a dictionary
to avoid multiple names for a single thing. For encoded
fields, the option allowing the user to enter his own text
should be avoided if possible, as it is contrary to the idea.
All these actions are mostly performed by data
management; however, the programming
skills offered by biostatistics make an
excellent opportunity to improve the process
by preparing scripts querying the database
in search of missing rules, checks and
violations of certain naming conventions.
Assisted Data Validation
This kind of support covers analyses done on
request and usually together with personnel
from other departments, like clinicians,
administrators and managers. It is mainly
used for deeper investigations which cover
various aspects of a trial and involve much
more advanced methods than usual.
Various statistical methods are in use, for
example:
• an extended set of descriptive
statistics, including robust, both
classic and positional measures
• graphical analyses using various
combinations of scatterplots,
boxplots, mosaic plots, histograms
and various types of density plots,
as well as custom plots revealing
specific patterns in data
Technology
Missing Ref. Range end both lower upper
RefRange lower upper
clunit
%
10^3
10^9/L
1000/uL
G/L
x10^3/ul
x10^6/uL
Result index
0 25 50 75 100
LOGResult[x10^9/L]
1X10
+2
1X10
+1
1X10
+0
1X10
-1
1X10
-2
1X10
-3
Chart 1: An exemplary diagram revealing typical issues found in laboratory data:
missing values, incomplete and missing reference ranges, incorrect units assigned

Volume 8 Issue 544 Journal for Clinical Studies
• analysis of possible outliers done both graphically
and mathematically
• analysis of suspicious data by looking for patterns
in coexisting values in view of surrounding
circumstances, involving graphical and mathematical
methods, like decision trees
• analysis of randomness in data samples
• analysis of patterns in missing data by using
specialised graphs
We have found that graphical methods are especially
useful in communication with clinicians and managers.
Well-designed graphics immediately reveal patterns
and make the user able to grasp a lot of information. It
works perfectly while searching for patterns in missing
data, investigating possible frauds and investigating
laboratory data.
A good example of such activity is a process of
reviewing results of laboratory tests expressed in various
units. By applying a set of conversion factors between
units, it is possible to unify all values and show them on
a common chart along with reference ranges and other
information. This shows immediately which units were
chosen and if they are valid, whether observations have
incorrect values or if a corresponding reference range
(or one of its ends) is missing. This message is easy to
understand and reduces the need to get through long
tables of numbers.
Assisted Data Validation – Fraud and Misconduct
The detection of potential fraud and misconduct involves
both graphical and statistical methods. At the first stage,
the biostatistics team tries to picture the situation with
simple plots, which are then discussed in a team of
clinicians, managers and other specialists. All doubtful
patterns are examined by statisticians using various simple
and advanced, multidimensional methods. In the end,
the statisticians present findings and recommendations
for decision-making. Such investigation can reveal
intentional, harmful activity as well as showing certain
weaknesses of procedures and deficits in training.
Abnormally low or high dispersion in data,
relationships between means and dispersions, highly
skewed distributions (when not expected), departures
from shapes of distribution characterised in a protocol,
unexpected patterns in data like “steps” and “clusters”,
strange relationships between variables, unexpected
patterns in missing data, periodicity in occurrences of
specific issues and many other things can be detected
by well-trained biostatisticians and revealed before
clinicians and managers.
Creations of Tools for Unassisted, Repeatable Data
Validation
The key to success is to perform the data checking as
often as possible. Daily checking is not unusual. On
the other hand, it may become a very time-consuming
process and frequently involving the biostatistics team
in running required analyses does not seem to be the
best option. The fact that many valuable analyses do not
require any statistical advisory has helped us to develop a
reporting tool that can be used by the data management
staff alone.
The first step is to create a list of required analyses,
where items are prioritised and grouped by predefined
categories. For each report, a set of parameters and
their default values are determined as well. The next
step refers to technical matters, like the selection of the
technology to be used, choice of a method of accessing
the database, description of a user authorisation process,
shape of a graphical user interface, selection of the
desired output formats, etc. Since long-lasting analyses
slow down the database, its content should be replicated
to another instance or exported to an intermediate
file (XML, CSV, etc.) before the analysis. In order to
save money, the chosen technology should allow the
utilisation of already existing resources, i.e. hardware,
software, statistical programmers and administrators. In
this case, if R programmers are already on board, the R
package should be considered as the default development
platform first rather than other technologies (.NET, Java,
PHP, etc.) which would require the hiring of additional
programmers.
We decided to create the tool as a self-contained,
windows-based application hosted entirely by the R
package. GNU R is a well-known, powerful, acclaimed
and free statistical package, as well as a high-level
programming language. It is a strong SAS competitor,
used worldwide by millions of users , huge corporations
and organisations, including FDA. R is an open-source
project, developed by the R Core Team, and supported
by the R Consortium which consists of companies like
Microsoft, Oracle, IBM and Google.
The contents of the R library address practically every
topic in biostatistics , including clinical research. R is
capable of reading data and producing output in various
formats, including SAS datasets, Microsoft Office and
PDF documents. Extensive support for querying numerous
kinds of data sources (also via SQL), implementation of the
reproducible research paradigm, three advanced charting
systems, the ability to host embedded user interfaces and
web applications, full portability understood as an ability
to run without the installation on almost every operating
system and a huge, dynamic society of users, make R a
good candidate for a reliable programmatic environment.
The created tool is capable of running a wide range of
a laboratory data reconciliation as well as trial-specific
analyses. The implemented set of analyses allows for
detection of: missing visits, empty mandatory fields,
inconsistencies in certain data domains, various kinds
of misconduct, discrepancies between the database and
specification in units, normal ranges and flags, missing
Technology

Technology
laboratory examinations, departures from a schedule
described in the protocol and invalid results, to name
only a few. It has proven its usefulness in everyday
practice. Now it takes only a few minutes for the full set of
analyses and just a few seconds for a single report, when
previously it took long hours to create a corresponding
Excel report manually. By using the tool we were able to
detect serious issues and take certain remedies before
the situation got serious.
Screening Ad Hoc Analyses
The process of writing programs for the final statistical
report is a perfect moment for assessing the quality of
collected data long before analysing them. We call them
“screening programs” and use them to check if the data
is clean enough to perform a certain part of the analysis.
Screening analyses are valuable due to the nature
of their creation: while writing the statistical analysis
program, the statistician plays a lot with the data by
writing a number of queries and checking the content
of a database in many ways. This often results in useful
queries, which normally might have never been requested.
By the use of the reproducible research paradigm
implementation available in R, it is possible to embed
these analyses directly into the main statistical analysis
program.
Automated Scour Analysis
This is an automated enhancement of the screening data
validation, working in the background, and has more of an
“alerting” nature. A program scours the database content
periodically in search of specific issues and reports findings
via email or stores them in an HTML log. The fact that the
amount of time required to complete such an analysis is
of low importance, there is no direct, intended interaction
between ordinary users and the system, and that R is not
resource-consuming and can be deployed in a machine
with any architecture, makes it possible to implement
the tool on simplified minicomputers like Raspberry Pi.
This eliminates the need to buy a new machine or install
new software on an existing, stable server. An additional
small (3.7”) breadboard with LCD touchscreen will enable
a limited interaction with the script.
Simple data
inspector
Dictionaries
User interface
TemplatesQueries
Direct access
Access
via export
Scripts
EDC
Software
SQL
</>
CSV
<CSS>
&
<html>
<html>
<XML>
Access via database
interfaces: OBDC/JDBC
PDF
Site ID
1
1
1
1
2
SubjID
3
3
4
5
5
Lab Test
RBC
WBC
ESR
Hb
B-HCG
Screening
OK
OK
OK
OK
N/A
Day 1
OK
MISSING
MISSING
OK
OK
Day 2
OK
N/A
MISSING
N/A
MISSING
Scheme 1: An overall architecture of a typical reporting system

Data Converters
A data converter is a kind of program which transforms
data from one form to another. Its sole task is to
eliminate the human factor during the process of data
transformation as much as possible.
Transferring results of clinical examinations from an
external laboratory into an EDC database, followed by
additional data integrity checks, makes a good example
of such a process. At KCR we constitute data converters
every time the adjustment of received data format
is required. As previously, the R statistical package is
used for that purpose, which significantly facilitates
complicated operations on data spread over multiple,
differentiated sources. Advanced querying capabilities
together with the availability of interfaces to numerous
database engines make the process of transferring data
extremely simple in comparison to traditional, high-level
programming languages, and can be done in a very few
lines of code.
Training and Mentoring
Sharing knowledge about possible issues that can happen
to data as well as emphasising their impact on the
analysis results is no less important than the analytical
support itself. If people understand why certain matters
are so important, they are more cooperative and follow
the rules more willingly. In order to raise a better, more
general awareness in these matters, we decided to
organise a series of courses for non-statisticians. The
audience has demonstrated high interest, which confirms
that our efforts and direction were right.
Summary
Data validation is a process of great importance, having
significant implications for the reliability of the final
data analysis. There are many possible sources of issues,
which makes it really difficult to identify them all and
react quickly enough. From the early stages of a trial to
its very end, at every turn, this is where the programmatic
and statistical support provided by the biostatistics team
comes to the rescue. At KCR, both departments closely
cooperate with each other and have been organised in a
common biometrics unit in order to facilitate the flow of
information.
References
1. Oracle Corporation, “Scaling R to the Enterprise. Using
R for Enterprise-level Performance, Scalability, Ease
of Production Deployment, and Security”, An Oracle
White Paper, July 2016, http://www.oracle.com/
technetwork/database/options/advanced-analytics/
r-enterprise/bringing-r-to-the-enterprise-1956618.
pdf
2. Olszewski Adrian, “Is R suitable enough for
biostatisticians involved in clinical research and
evidence-based medicine?”, June 15th 2015, http://r-
clinical-research.com
3. Smith David, Microsoft Corporation (formerly
Revolution Analytics), “FDA: R OK for drug trials”,
June 21st 2012, http://blog.revolutionanalytics.
com/2012/06/fda-r-ok.html
4. Smith David, Microsoft Corporation (formerly
Revolution Analytics), “Companies using R in 2014”,
May 23rd 2014, http://blog.revolutionanalytics.
com/2014/05/companies-using-r-in-2014.html
Technology
Adrian Olszewski is Biostatistician in the
Biometrics & Clinical Trial Data Execution
Systems Department at KCR, a contract
research organisation (CRO). Adrian is
involved in delivering informatics and
analytical solutions for medicine, pharmacy
and clinical laboratory diagnostics. He has a
profound knowledge in statistics in the field of evidence-
based medicine, especially in clinical research. Adrian is
responsible for providing comprehensive support for trials
from the early design considerations through the data
analysis – including interim evaluations – to the final
report. Adrian is also involved in various external projects
on widely understood data analysis and applications of
the R statistical package. Mr Olszewski holds a Master of
Science (MSc) degree in Computer Science.
Email: info@kcrcro.com

Journal for Clinical Studies: Close Cooperation Between Data Management and Biostatistics Benefits Data Quality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Journal for Clinical Studies: Close Cooperation Between Data Management and Biostatistics Benefits Data Quality

Similar to Journal for Clinical Studies: Close Cooperation Between Data Management and Biostatistics Benefits Data Quality (20)

More from KCR

More from KCR (20)

Recently uploaded

Recently uploaded (20)

Journal for Clinical Studies: Close Cooperation Between Data Management and Biostatistics Benefits Data Quality