Big Data in Clinical Research

Big Data in Clinical Research
Michael Hogarth, MD, FACMI, FACP
Clinical Research Information Officer, UC San Diego Health System
Professor, Dept of Medicine, UCSD Health

What is Big Data?
https://searchdatamanagement.techtarget.com/definition/big-data

The genesis of the “big data” movement
• Google began the new database
development revolution as
relational databases could not
handle the volume or type of data
efficiently to provide google search
• What Google built – “BigTable”
– Innovation 1 – column oriented with
each row as a web page
– Innovation 2- data is stored across
multiple machines (nodes) using
“Google File System”
• The entire web table is split into mini-tables (Table-
ets), each a few Gigabytes each – 100,000
– Innovation 3 – a ”map-reduce”
computing engine
– Scales to Petabytes!

Data Today
https://twitter.com/nafisalam/status/867359592006733824/photo/1

Importance of “big data” in healthcare
https://www.incoutlook.com/2019/08/02/big-data-a-game-changer-in-healthcare-industry/

Big Data in Clinical Research
• Predicting ‘feasibility’ of a trial and reducing
uncertainty
• Improving accrual through “smart matching”
of trials to potential participants
Clinical trial efficiency
• Hypothesis generation
• Pragmatic/large-scale trials - Real world
“evidence” (RWE)
• Uncovering patterns
Real world data and Real-World
Evidence (RWE)
• predictive algorithms
• assistive systems (image
analysis/enhancement)
Healthcare AI/ML models
• pharmacokinetic simulation (in-silico drug
discovery)
Drug design

Big Data and Real-World Evidence (RWE)

Using “Real World Evidence” (RWE)
Adoption of health IT has resulted in
large scale (massive) amounts of
biomedical “digital” data
“RWE will not replace the need for data
from traditional trials; however,
technologies supporting RWD are
enabling far richer and more diverse
information to be collected during drug
development drug development.”
Swift et al. “Innovation at the Intersection of Clinical Trials and Real-
World Data Science to Advance Patient Care. Clin Transl Sci (2018)
00, 1-11. https://www.ncbi.nlm.nih.gov/pubmed/29768712
(subtext: the randomized clinical trial is still
here and is not dead - but is perhaps
becoming an endangered species under
pressure from new trial designs and RWE)

An Example of an RWE Trial
• ADAPTABLE – Aspirin Dosing: A
Patient-Centric Trial Assessing
Benefits and Long-Term
Effectiveness
• Compares two aspirin doses
(81mg vs. 325mg)
• Randomizes 20,000 patients with
CVDz to one of the two doses
• Currently underway through the
National Patient-Centered Clinical
Research Network (PCORNet) –
had 600,000 eligible patients

Wearable Sensors – Billions of data points

RWE Trial Using a Wearable Sensor

Clinical Genomics
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

RNAseq – a key advancement for post-
translational/DNA processes

Comparing genomic data types in
terms of “big data”
Nature Reviews, Genetics, Volume 19, April 2018

Biomedical research often requires you to become
a ”data wrangler”
Obtaining and preparing
”real world data” (RDW) from an
EHR
Privacy and your data stewardship
Challenges with healthcare data
Your Data Wrangler Toolbelt

RCHSD Clinical Research Seminar Series -
07/28/2020
Types of Biomedical Research Data
– Pre-Clinical Experiments
• a broad range of data from wet-lab
experiments, animal models, etc..
• typically created/stored in
lab/bioinformatics systems
– Clinical Trial data
• specific trial-related data collection
• typically collected using electronic case
report forms (eCRFs)
– Clinical Practice data
• generated during routine clinical practice
events
• typically created/stored in an EHR
• clinical practice data = real world data
(RWD)
What is real world data (RWD)?
– “any health record information not
collected as part of a randomized
controlled trial”
– “With RDW, we mean data that are not
collected under experimental conditions,
but data generated in routine care”
TYPES OF
BIOMEDICAL
RESEARCH DATA

Privacy and Healthcare Data --
Being a responsible steward of patient data
Covered entities may disclose PHI for
research with *individual authorization”
Circumstances under which research
use can proceed without authorization
• IRB (or privacy board) issues a ”waiver of
HIPAA authorization”
• must satisfy 3 criteria:
• use/disclosure involves no more than minimal
risk to privacy of the individuals
• the research could not be conducted without the
waiver
• the research could not be conducted without
access to PHI.
• minimal risk means:
• a plan to protect identifiers from disclosure
• a plan to destroy the identifiers at the earliest
opportunity
• written assurance the PHI will not be reused or
disclosed to others

https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#protected
What is a “HIPAA Limited Data Set (LDS)”?

Obtaining and Managing UCSDH Data
• Data Stewards - honest brokers
– UCSDH requires that you use a
‘data steward’ as an honest
broker
• You cannot access the warehouse
yourself – HIPAA principle of
“minimum necessary”
– Data stewards are allowed to query “all
records” to find specific cohort
• ACTRI Data Extraction Concierge
Service (DECS)
• The “federated data stewards
program”
– Ophthalmology
– Family Medicine
– O’Brien Center (Nephrology)
– Etc...
• UCSD Health Nightingale data sets – coming in early
2022 – LDS under DUA
– AKI
– Cancer
– COVID
UCSD Health Virtual Research Desktop (VRD)
Shin SY, et al. Healthc Inform Res. 2014;20(2):109-116.

The ACTRI Data Extraction Concierge Service (DECS)

Epic slicer dicer and cohort discovery
23

Using Slicer-Dicer to facilitate DECS exports

The UCSD Health Virtual Research Desktop
What is a VRD?

Common challenges with healthcare
data
The nature of EHR data
(“dirta”)
• Harmonizing to a
common set of codes
and values
• EHR data is often a
proxy for what really
happened
• Missing values are
common -- charting by
exception
• Data can be in multiple
places and in different
forms
Unstructured narrative text
contains many of the
desired data points
• Need to ‘find’ information
in narrative text
• Often requires some
form of text mining or
natural language
processing
Comorbidities
• Elixhauser
• Charlson

“Dirta” and why we “harmonize” data
Blood Pressure – what do we really mean ?
If we do an analysis, are we comparing the same things?
BP
“150/70 mmHg”
John S| BP | 1-14-
2015
Blood pressure stored in system #2
Blood pressure stored in system #1
Mary P | BP | 1-
14-2015
Systolic BP Diastolic BP Type Encounter
150 70 Arterial
Line
In Patient
What happens when we want to compute with data from both systems?

Standard coding systems used with harmonized EHR data
Conditions:
ICD 10 CM /
SNOMED CT
Labs, vitals,
reports -
observations:
LOINC
Medical
procedures:
CPT
Medications:
RxNORM

How many blood pressure measurement
types could there possibly be?
477 types of blood
pressure
measurements

The challenge of using “drug class”

The University of California Clinical Data Warehouse –
one of the larges OMOP CDM repositories today
The UCDW
has harmonized
data for
15M patients
in OMOP
The Observational Medical
Outcomes Partnership Common
Data Model (OMOP CDM_

The UC COVID Research Data Set (UC-CORDS)
• How Much Data? (Dec 2021)
– 687,239 patients
– 1,061,471,489 (>1.06 billion) observations
(lab results, vital signs)
– 23,243,659 clinical encounters
• Provided to UC Health researchers
through UC research informatics
units/groups in their respective health
systems
• Cohort
– All COVID tested patients (positive or
negative)
• De-identified to HIPAA LDS - no
personal identifiers
• Data
– Demographics, Diagnoses, Medications,
Laboratory results, Encounters (if seen by
our doctors or hospitalized)
• Refreshed
– Every 2 months

Learning to analyze very large data sets

Google BigTable (2004) -- managing “big data”
• The web circa 2000:
– 2Billion web pages
– 45Terabytes of data
– Contemporary databases
(RDBMS) were not able to cope
• Needed to invent a new
database
– Adam Bosworth and Jeff Dean
• Google BigTable
– Google distributed File System – “GFS”
– Virtualized single database “table” across
thousands of computer
– Stores data in a “column-oriented” database
design

“Map-Reduce” (Hadoop/Spark)
– analyzing large data sets efficiently and at scale
• Google needed to process
large amounts of raw data,
such as ”crawled” (acquired)
documents, web request
logs, etc..
• Created a simple computational
model called ‘map-reduce’ which can
process very large data sets
• Hadoop = open source version of a
map-reduce execution engine
https://www.guru99.com/introduction-to-mapreduce.html

Machine Learning
A form of artificial intelligence
• Identifies correlated patters in
order to create predictions
• Traditional “machine learning” –
linear regression, logistic
regression.
• Today’s “machine learning” -
leverages artificial neural networks
(ANNs)
• “Deep learning” – a deep neural
network which has many nodal
connections within its hidden layer
– designed to work like the visual
cortex... Most mature in image
analysis
– A convolutional neural network (CNN)
is a typical deep learning network
Rashidi, et al. Academic Pathology. 2019

“Machine Learning” and AI in Healthcare
• AI is not new to healthcare, but “machine learning” and “deep learning” are
new AI techniques enabled by large data sets.
https://mc.ai/awesome-ai-the-guide-to-master-artificial-intelligence/
1983

Deep Networks -- “Convolutional” Neural Networks (CNN) began
as an approach to image recognition
• a type of multi-layered (deep) ANN designed for
recognition of visual features - “feedforward neural
network whose architectures are tailored for
minimizing the sensitivity to translations, local
rotations and distortions of the input image.”
• originally devised in 1988 by Yann LeCun at AT&T Labs
to recognize handwritten digits
• inspired by the connectivity pattern seen between
neurons in the visual cortex
• Convolution operation emulates response of an
individual neuron to visual stimuli. Pooling coalesces
outputs from one layer into a single neuron in the
next layer

Example: Deep Learning and Interpretation
of Retinal Images for diabetic retinopathy
• trained a ML systems
with 494,661 retinal
images
• validated (tested) using
dataset of 71,896
images from 14,880
patients
• detection of vision-
threatening diabetic
retinopathy: AUC 0.958
(95% CI)
• detection of referable
diabetic retinopathy:
AUC=0.936(95% CI)

ML-based Chest X-Ray Assistive Interpretation in COVID

Did it make a difference?
1 out of 5 felt it had an impact
30% felt it affected the treatment plan

Big Data in Drug Design:
“in-silico” drug design through structural
bioinformatics and computer-based simulation
http://www.vls3d.com/index.php/virtual-screening/comments-about-virtuel-screening?start=3

Computational Drug Discovery as In-silico
Drug, Genome, Proteome interactions
Requires large-scale computing!

Supercomputers - massive parallel processing
• supercomputers split problems
into pieces and execute each at
the same time – massive
parallel processing
• Designed for mathematical
computation, which occurs in
simulations and optimization
problems --- where multiple co-
dependencies exist
Current top supercomputers
Uses of supercomputers 1970-2020

Computer Simulation and COVID
The SUMMIT supercomputer was used
to predict pharmacological therapeutic
targets to interfere with SARS-CoV-2
spike protein binding to ACE2 receptor
in type2 pneumocytes

Limits of classical computers in
simulation

Going beyond “classical” computers
• Only having two states (1 or 0)
for a single “bit” creates some
limitations:
– The larger the number of
input states you want (ie,
larger number or higher
number of simultaneous bits
to transmit), the higher the
number of “wires” you need
– The higher number of wires
means more power, heat,
unless you can make it all
smaller and more compact
• The Apple M1 CPU (8core, 64-
bit) has 16 billion transistors
and implements 5 nanometer
(nm) “transistor gate length”
(the smallest yet achieved).
– At some point, you reach a
limit in terms of what you can
compute with a “classical
computer” (binary computer)

Investment in Quantum Computing

Have we reached “quantum supremacy”?
• A Google-designed quantum computer
was used in an experiment to perform
a calculation that would require a
classical supercomputer 10,000 years
to complete
• The calculation was to predict the
likelihood of outcomes from a random-
number generator (a problem first
crafted by Google physicists in 2016)
• The 53-qubit quantum computer
performed the calculation in 200
seconds
• ?Was it a contrived experiment?

Questions?
Mt Whitney - 14,505ft
Lone Pine, California (Feb 2021)

Big Data in Clinical Research

Recommended

Recommended

More Related Content

Similar to Big Data in Clinical Research

Similar to Big Data in Clinical Research (20)

More from Mike Hogarth, MD, FACMI, FACP

More from Mike Hogarth, MD, FACMI, FACP (17)

Recently uploaded

Recently uploaded (20)

Big Data in Clinical Research