Big Data in Clinical Research
Michael Hogarth, MD, FACMI, FACP
Clinical Research Information Officer, UC San Diego Health System
Professor, Dept of Medicine, UCSD Health
What is Big Data?
https://searchdatamanagement.techtarget.com/definition/big-data
The genesis of the “big data” movement
• Google began the new database
development revolution as
relational databases could not
handle the volume or type of data
efficiently to provide google search
• What Google built – “BigTable”
– Innovation 1 – column oriented with
each row as a web page
– Innovation 2- data is stored across
multiple machines (nodes) using
“Google File System”
• The entire web table is split into mini-tables (Table-
ets), each a few Gigabytes each – 100,000
– Innovation 3 – a ”map-reduce”
computing engine
– Scales to Petabytes!
Data Today
https://twitter.com/nafisalam/status/867359592006733824/photo/1
Importance of “big data” in healthcare
https://www.incoutlook.com/2019/08/02/big-data-a-game-changer-in-healthcare-industry/
Big Data in Clinical Research
• Predicting ‘feasibility’ of a trial and reducing
uncertainty
• Improving accrual through “smart matching”
of trials to potential participants
Clinical trial efficiency
• Hypothesis generation
• Pragmatic/large-scale trials - Real world
“evidence” (RWE)
• Uncovering patterns
Real world data and Real-World
Evidence (RWE)
• predictive algorithms
• assistive systems (image
analysis/enhancement)
Healthcare AI/ML models
• pharmacokinetic simulation (in-silico drug
discovery)
Drug design
Big Data and Real-World Evidence (RWE)
Using “Real World Evidence” (RWE)
Adoption of health IT has resulted in
large scale (massive) amounts of
biomedical “digital” data
“RWE will not replace the need for data
from traditional trials; however,
technologies supporting RWD are
enabling far richer and more diverse
information to be collected during drug
development drug development.”
Swift et al. “Innovation at the Intersection of Clinical Trials and Real-
World Data Science to Advance Patient Care. Clin Transl Sci (2018)
00, 1-11. https://www.ncbi.nlm.nih.gov/pubmed/29768712
(subtext: the randomized clinical trial is still
here and is not dead - but is perhaps
becoming an endangered species under
pressure from new trial designs and RWE)
An Example of an RWE Trial
• ADAPTABLE – Aspirin Dosing: A
Patient-Centric Trial Assessing
Benefits and Long-Term
Effectiveness
• Compares two aspirin doses
(81mg vs. 325mg)
• Randomizes 20,000 patients with
CVDz to one of the two doses
• Currently underway through the
National Patient-Centered Clinical
Research Network (PCORNet) –
had 600,000 eligible patients
Wearable Sensors – Billions of data points
Mobile Sensors
RWE Trial Using a Wearable Sensor
Clinical Genomics
https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
RNAseq – a key advancement for post-
translational/DNA processes
Comparing genomic data types in
terms of “big data”
Nature Reviews, Genetics, Volume 19, April 2018
Biomedical research often requires you to become
a ”data wrangler”
Obtaining and preparing
”real world data” (RDW) from an
EHR
Privacy and your data stewardship
Challenges with healthcare data
Your Data Wrangler Toolbelt
RCHSD Clinical Research Seminar Series -
07/28/2020
Types of Biomedical Research Data
– Pre-Clinical Experiments
• a broad range of data from wet-lab
experiments, animal models, etc..
• typically created/stored in
lab/bioinformatics systems
– Clinical Trial data
• specific trial-related data collection
• typically collected using electronic case
report forms (eCRFs)
– Clinical Practice data
• generated during routine clinical practice
events
• typically created/stored in an EHR
• clinical practice data = real world data
(RWD)
What is real world data (RWD)?
– “any health record information not
collected as part of a randomized
controlled trial”
– “With RDW, we mean data that are not
collected under experimental conditions,
but data generated in routine care”
TYPES OF
BIOMEDICAL
RESEARCH DATA
Real-World Data in an EHR
Privacy and Healthcare Data --
Being a responsible steward of patient data
Covered entities may disclose PHI for
research with *individual authorization”
Circumstances under which research
use can proceed without authorization
• IRB (or privacy board) issues a ”waiver of
HIPAA authorization”
• must satisfy 3 criteria:
• use/disclosure involves no more than minimal
risk to privacy of the individuals
• the research could not be conducted without the
waiver
• the research could not be conducted without
access to PHI.
• minimal risk means:
• a plan to protect identifiers from disclosure
• a plan to destroy the identifiers at the earliest
opportunity
• written assurance the PHI will not be reused or
disclosed to others
https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#protected
What is a “HIPAA Limited Data Set (LDS)”?
Obtaining and Managing UCSDH Data
• Data Stewards - honest brokers
– UCSDH requires that you use a
‘data steward’ as an honest
broker
• You cannot access the warehouse
yourself – HIPAA principle of
“minimum necessary”
– Data stewards are allowed to query “all
records” to find specific cohort
• ACTRI Data Extraction Concierge
Service (DECS)
• The “federated data stewards
program”
– Ophthalmology
– Family Medicine
– O’Brien Center (Nephrology)
– Etc...
• UCSD Health Nightingale data sets – coming in early
2022 – LDS under DUA
– AKI
– Cancer
– COVID
UCSD Health Virtual Research Desktop (VRD)
Shin SY, et al. Healthc Inform Res. 2014;20(2):109-116.
The ACTRI Data Extraction Concierge Service (DECS)
Epic slicer dicer and cohort discovery
23
Using Slicer-Dicer to facilitate DECS exports
The UCSD Health Virtual Research Desktop
What is a VRD?
Common challenges with healthcare
data
The nature of EHR data
(“dirta”)
• Harmonizing to a
common set of codes
and values
• EHR data is often a
proxy for what really
happened
• Missing values are
common -- charting by
exception
• Data can be in multiple
places and in different
forms
Unstructured narrative text
contains many of the
desired data points
• Need to ‘find’ information
in narrative text
• Often requires some
form of text mining or
natural language
processing
Comorbidities
• Elixhauser
• Charlson
“Dirta” and why we “harmonize” data
Blood Pressure – what do we really mean ?
If we do an analysis, are we comparing the same things?
BP
“150/70 mmHg”
John S| BP | 1-14-
2015
Blood pressure stored in system #2
Blood pressure stored in system #1
Mary P | BP | 1-
14-2015
Systolic BP Diastolic BP Type Encounter
150 70 Arterial
Line
In Patient
What happens when we want to compute with data from both systems?
Standard coding systems used with harmonized EHR data
Conditions:
ICD 10 CM /
SNOMED CT
Labs, vitals,
reports -
observations:
LOINC
Medical
procedures:
CPT
Medications:
RxNORM
How many blood pressure measurement
types could there possibly be?
477 types of blood
pressure
measurements
The challenge of using “drug class”
The University of California Clinical Data Warehouse –
one of the larges OMOP CDM repositories today
The UCDW
has harmonized
data for
15M patients
in OMOP
The Observational Medical
Outcomes Partnership Common
Data Model (OMOP CDM_
The UC COVID Research Data Set (UC-CORDS)
• How Much Data? (Dec 2021)
– 687,239 patients
– 1,061,471,489 (>1.06 billion) observations
(lab results, vital signs)
– 23,243,659 clinical encounters
• Provided to UC Health researchers
through UC research informatics
units/groups in their respective health
systems
• Cohort
– All COVID tested patients (positive or
negative)
• De-identified to HIPAA LDS - no
personal identifiers
• Data
– Demographics, Diagnoses, Medications,
Laboratory results, Encounters (if seen by
our doctors or hospitalized)
• Refreshed
– Every 2 months
Learning to analyze very large data sets
Google BigTable (2004) -- managing “big data”
• The web circa 2000:
– 2Billion web pages
– 45Terabytes of data
– Contemporary databases
(RDBMS) were not able to cope
• Needed to invent a new
database
– Adam Bosworth and Jeff Dean
• Google BigTable
– Google distributed File System – “GFS”
– Virtualized single database “table” across
thousands of computer
– Stores data in a “column-oriented” database
design
“Map-Reduce” (Hadoop/Spark)
– analyzing large data sets efficiently and at scale
• Google needed to process
large amounts of raw data,
such as ”crawled” (acquired)
documents, web request
logs, etc..
• Created a simple computational
model called ‘map-reduce’ which can
process very large data sets
• Hadoop = open source version of a
map-reduce execution engine
https://www.guru99.com/introduction-to-mapreduce.html
Machine Learning
A form of artificial intelligence
• Identifies correlated patters in
order to create predictions
• Traditional “machine learning” –
linear regression, logistic
regression.
• Today’s “machine learning” -
leverages artificial neural networks
(ANNs)
• “Deep learning” – a deep neural
network which has many nodal
connections within its hidden layer
– designed to work like the visual
cortex... Most mature in image
analysis
– A convolutional neural network (CNN)
is a typical deep learning network
Rashidi, et al. Academic Pathology. 2019
“Machine Learning” and AI in Healthcare
• AI is not new to healthcare, but “machine learning” and “deep learning” are
new AI techniques enabled by large data sets.
https://mc.ai/awesome-ai-the-guide-to-master-artificial-intelligence/
1983
Deep Networks -- “Convolutional” Neural Networks (CNN) began
as an approach to image recognition
• a type of multi-layered (deep) ANN designed for
recognition of visual features - “feedforward neural
network whose architectures are tailored for
minimizing the sensitivity to translations, local
rotations and distortions of the input image.”
• originally devised in 1988 by Yann LeCun at AT&T Labs
to recognize handwritten digits
• inspired by the connectivity pattern seen between
neurons in the visual cortex
• Convolution operation emulates response of an
individual neuron to visual stimuli. Pooling coalesces
outputs from one layer into a single neuron in the
next layer
Example: Deep Learning and Interpretation
of Retinal Images for diabetic retinopathy
• trained a ML systems
with 494,661 retinal
images
• validated (tested) using
dataset of 71,896
images from 14,880
patients
• detection of vision-
threatening diabetic
retinopathy: AUC 0.958
(95% CI)
• detection of referable
diabetic retinopathy:
AUC=0.936(95% CI)
ML-based Chest X-Ray Assistive Interpretation in COVID
Did it make a difference?
1 out of 5 felt it had an impact
30% felt it affected the treatment plan
Big Data in Drug Design:
“in-silico” drug design through structural
bioinformatics and computer-based simulation
http://www.vls3d.com/index.php/virtual-screening/comments-about-virtuel-screening?start=3
Computational Drug Discovery as In-silico
Drug, Genome, Proteome interactions
Requires large-scale computing!
Supercomputers - massive parallel processing
• supercomputers split problems
into pieces and execute each at
the same time – massive
parallel processing
• Designed for mathematical
computation, which occurs in
simulations and optimization
problems --- where multiple co-
dependencies exist
Current top supercomputers
Uses of supercomputers 1970-2020
Computer Simulation and COVID
The SUMMIT supercomputer was used
to predict pharmacological therapeutic
targets to interfere with SARS-CoV-2
spike protein binding to ACE2 receptor
in type2 pneumocytes
Limits of classical computers in
simulation
Going beyond “classical” computers
• Only having two states (1 or 0)
for a single “bit” creates some
limitations:
– The larger the number of
input states you want (ie,
larger number or higher
number of simultaneous bits
to transmit), the higher the
number of “wires” you need
– The higher number of wires
means more power, heat,
unless you can make it all
smaller and more compact
• The Apple M1 CPU (8core, 64-
bit) has 16 billion transistors
and implements 5 nanometer
(nm) “transistor gate length”
(the smallest yet achieved).
– At some point, you reach a
limit in terms of what you can
compute with a “classical
computer” (binary computer)
Investment in Quantum Computing
Have we reached “quantum supremacy”?
• A Google-designed quantum computer
was used in an experiment to perform
a calculation that would require a
classical supercomputer 10,000 years
to complete
• The calculation was to predict the
likelihood of outcomes from a random-
number generator (a problem first
crafted by Google physicists in 2016)
• The 53-qubit quantum computer
performed the calculation in 200
seconds
• ?Was it a contrived experiment?
Questions?
Mt Whitney - 14,505ft
Lone Pine, California (Feb 2021)

Big Data in Clinical Research

  • 1.
    Big Data inClinical Research Michael Hogarth, MD, FACMI, FACP Clinical Research Information Officer, UC San Diego Health System Professor, Dept of Medicine, UCSD Health
  • 2.
    What is BigData? https://searchdatamanagement.techtarget.com/definition/big-data
  • 3.
    The genesis ofthe “big data” movement • Google began the new database development revolution as relational databases could not handle the volume or type of data efficiently to provide google search • What Google built – “BigTable” – Innovation 1 – column oriented with each row as a web page – Innovation 2- data is stored across multiple machines (nodes) using “Google File System” • The entire web table is split into mini-tables (Table- ets), each a few Gigabytes each – 100,000 – Innovation 3 – a ”map-reduce” computing engine – Scales to Petabytes!
  • 4.
  • 5.
    Importance of “bigdata” in healthcare https://www.incoutlook.com/2019/08/02/big-data-a-game-changer-in-healthcare-industry/
  • 6.
    Big Data inClinical Research • Predicting ‘feasibility’ of a trial and reducing uncertainty • Improving accrual through “smart matching” of trials to potential participants Clinical trial efficiency • Hypothesis generation • Pragmatic/large-scale trials - Real world “evidence” (RWE) • Uncovering patterns Real world data and Real-World Evidence (RWE) • predictive algorithms • assistive systems (image analysis/enhancement) Healthcare AI/ML models • pharmacokinetic simulation (in-silico drug discovery) Drug design
  • 7.
    Big Data andReal-World Evidence (RWE)
  • 8.
    Using “Real WorldEvidence” (RWE) Adoption of health IT has resulted in large scale (massive) amounts of biomedical “digital” data “RWE will not replace the need for data from traditional trials; however, technologies supporting RWD are enabling far richer and more diverse information to be collected during drug development drug development.” Swift et al. “Innovation at the Intersection of Clinical Trials and Real- World Data Science to Advance Patient Care. Clin Transl Sci (2018) 00, 1-11. https://www.ncbi.nlm.nih.gov/pubmed/29768712 (subtext: the randomized clinical trial is still here and is not dead - but is perhaps becoming an endangered species under pressure from new trial designs and RWE)
  • 9.
    An Example ofan RWE Trial • ADAPTABLE – Aspirin Dosing: A Patient-Centric Trial Assessing Benefits and Long-Term Effectiveness • Compares two aspirin doses (81mg vs. 325mg) • Randomizes 20,000 patients with CVDz to one of the two doses • Currently underway through the National Patient-Centered Clinical Research Network (PCORNet) – had 600,000 eligible patients
  • 10.
    Wearable Sensors –Billions of data points
  • 11.
  • 12.
    RWE Trial Usinga Wearable Sensor
  • 13.
  • 14.
    RNAseq – akey advancement for post- translational/DNA processes
  • 15.
    Comparing genomic datatypes in terms of “big data” Nature Reviews, Genetics, Volume 19, April 2018
  • 16.
    Biomedical research oftenrequires you to become a ”data wrangler” Obtaining and preparing ”real world data” (RDW) from an EHR Privacy and your data stewardship Challenges with healthcare data Your Data Wrangler Toolbelt
  • 17.
    RCHSD Clinical ResearchSeminar Series - 07/28/2020 Types of Biomedical Research Data – Pre-Clinical Experiments • a broad range of data from wet-lab experiments, animal models, etc.. • typically created/stored in lab/bioinformatics systems – Clinical Trial data • specific trial-related data collection • typically collected using electronic case report forms (eCRFs) – Clinical Practice data • generated during routine clinical practice events • typically created/stored in an EHR • clinical practice data = real world data (RWD) What is real world data (RWD)? – “any health record information not collected as part of a randomized controlled trial” – “With RDW, we mean data that are not collected under experimental conditions, but data generated in routine care” TYPES OF BIOMEDICAL RESEARCH DATA
  • 18.
  • 19.
    Privacy and HealthcareData -- Being a responsible steward of patient data Covered entities may disclose PHI for research with *individual authorization” Circumstances under which research use can proceed without authorization • IRB (or privacy board) issues a ”waiver of HIPAA authorization” • must satisfy 3 criteria: • use/disclosure involves no more than minimal risk to privacy of the individuals • the research could not be conducted without the waiver • the research could not be conducted without access to PHI. • minimal risk means: • a plan to protect identifiers from disclosure • a plan to destroy the identifiers at the earliest opportunity • written assurance the PHI will not be reused or disclosed to others
  • 20.
  • 21.
    Obtaining and ManagingUCSDH Data • Data Stewards - honest brokers – UCSDH requires that you use a ‘data steward’ as an honest broker • You cannot access the warehouse yourself – HIPAA principle of “minimum necessary” – Data stewards are allowed to query “all records” to find specific cohort • ACTRI Data Extraction Concierge Service (DECS) • The “federated data stewards program” – Ophthalmology – Family Medicine – O’Brien Center (Nephrology) – Etc... • UCSD Health Nightingale data sets – coming in early 2022 – LDS under DUA – AKI – Cancer – COVID UCSD Health Virtual Research Desktop (VRD) Shin SY, et al. Healthc Inform Res. 2014;20(2):109-116.
  • 22.
    The ACTRI DataExtraction Concierge Service (DECS)
  • 23.
    Epic slicer dicerand cohort discovery 23
  • 24.
    Using Slicer-Dicer tofacilitate DECS exports
  • 25.
    The UCSD HealthVirtual Research Desktop What is a VRD?
  • 26.
    Common challenges withhealthcare data The nature of EHR data (“dirta”) • Harmonizing to a common set of codes and values • EHR data is often a proxy for what really happened • Missing values are common -- charting by exception • Data can be in multiple places and in different forms Unstructured narrative text contains many of the desired data points • Need to ‘find’ information in narrative text • Often requires some form of text mining or natural language processing Comorbidities • Elixhauser • Charlson
  • 27.
    “Dirta” and whywe “harmonize” data Blood Pressure – what do we really mean ? If we do an analysis, are we comparing the same things? BP “150/70 mmHg” John S| BP | 1-14- 2015 Blood pressure stored in system #2 Blood pressure stored in system #1 Mary P | BP | 1- 14-2015 Systolic BP Diastolic BP Type Encounter 150 70 Arterial Line In Patient What happens when we want to compute with data from both systems?
  • 28.
    Standard coding systemsused with harmonized EHR data Conditions: ICD 10 CM / SNOMED CT Labs, vitals, reports - observations: LOINC Medical procedures: CPT Medications: RxNORM
  • 29.
    How many bloodpressure measurement types could there possibly be? 477 types of blood pressure measurements
  • 30.
    The challenge ofusing “drug class”
  • 31.
    The University ofCalifornia Clinical Data Warehouse – one of the larges OMOP CDM repositories today The UCDW has harmonized data for 15M patients in OMOP The Observational Medical Outcomes Partnership Common Data Model (OMOP CDM_
  • 32.
    The UC COVIDResearch Data Set (UC-CORDS) • How Much Data? (Dec 2021) – 687,239 patients – 1,061,471,489 (>1.06 billion) observations (lab results, vital signs) – 23,243,659 clinical encounters • Provided to UC Health researchers through UC research informatics units/groups in their respective health systems • Cohort – All COVID tested patients (positive or negative) • De-identified to HIPAA LDS - no personal identifiers • Data – Demographics, Diagnoses, Medications, Laboratory results, Encounters (if seen by our doctors or hospitalized) • Refreshed – Every 2 months
  • 33.
    Learning to analyzevery large data sets
  • 34.
    Google BigTable (2004)-- managing “big data” • The web circa 2000: – 2Billion web pages – 45Terabytes of data – Contemporary databases (RDBMS) were not able to cope • Needed to invent a new database – Adam Bosworth and Jeff Dean • Google BigTable – Google distributed File System – “GFS” – Virtualized single database “table” across thousands of computer – Stores data in a “column-oriented” database design
  • 35.
    “Map-Reduce” (Hadoop/Spark) – analyzinglarge data sets efficiently and at scale • Google needed to process large amounts of raw data, such as ”crawled” (acquired) documents, web request logs, etc.. • Created a simple computational model called ‘map-reduce’ which can process very large data sets • Hadoop = open source version of a map-reduce execution engine https://www.guru99.com/introduction-to-mapreduce.html
  • 36.
    Machine Learning A formof artificial intelligence • Identifies correlated patters in order to create predictions • Traditional “machine learning” – linear regression, logistic regression. • Today’s “machine learning” - leverages artificial neural networks (ANNs) • “Deep learning” – a deep neural network which has many nodal connections within its hidden layer – designed to work like the visual cortex... Most mature in image analysis – A convolutional neural network (CNN) is a typical deep learning network Rashidi, et al. Academic Pathology. 2019
  • 37.
    “Machine Learning” andAI in Healthcare • AI is not new to healthcare, but “machine learning” and “deep learning” are new AI techniques enabled by large data sets. https://mc.ai/awesome-ai-the-guide-to-master-artificial-intelligence/ 1983
  • 38.
    Deep Networks --“Convolutional” Neural Networks (CNN) began as an approach to image recognition • a type of multi-layered (deep) ANN designed for recognition of visual features - “feedforward neural network whose architectures are tailored for minimizing the sensitivity to translations, local rotations and distortions of the input image.” • originally devised in 1988 by Yann LeCun at AT&T Labs to recognize handwritten digits • inspired by the connectivity pattern seen between neurons in the visual cortex • Convolution operation emulates response of an individual neuron to visual stimuli. Pooling coalesces outputs from one layer into a single neuron in the next layer
  • 39.
    Example: Deep Learningand Interpretation of Retinal Images for diabetic retinopathy • trained a ML systems with 494,661 retinal images • validated (tested) using dataset of 71,896 images from 14,880 patients • detection of vision- threatening diabetic retinopathy: AUC 0.958 (95% CI) • detection of referable diabetic retinopathy: AUC=0.936(95% CI)
  • 40.
    ML-based Chest X-RayAssistive Interpretation in COVID
  • 41.
    Did it makea difference? 1 out of 5 felt it had an impact 30% felt it affected the treatment plan
  • 42.
    Big Data inDrug Design: “in-silico” drug design through structural bioinformatics and computer-based simulation http://www.vls3d.com/index.php/virtual-screening/comments-about-virtuel-screening?start=3
  • 43.
    Computational Drug Discoveryas In-silico Drug, Genome, Proteome interactions Requires large-scale computing!
  • 44.
    Supercomputers - massiveparallel processing • supercomputers split problems into pieces and execute each at the same time – massive parallel processing • Designed for mathematical computation, which occurs in simulations and optimization problems --- where multiple co- dependencies exist Current top supercomputers Uses of supercomputers 1970-2020
  • 45.
    Computer Simulation andCOVID The SUMMIT supercomputer was used to predict pharmacological therapeutic targets to interfere with SARS-CoV-2 spike protein binding to ACE2 receptor in type2 pneumocytes
  • 46.
    Limits of classicalcomputers in simulation
  • 47.
    Going beyond “classical”computers • Only having two states (1 or 0) for a single “bit” creates some limitations: – The larger the number of input states you want (ie, larger number or higher number of simultaneous bits to transmit), the higher the number of “wires” you need – The higher number of wires means more power, heat, unless you can make it all smaller and more compact • The Apple M1 CPU (8core, 64- bit) has 16 billion transistors and implements 5 nanometer (nm) “transistor gate length” (the smallest yet achieved). – At some point, you reach a limit in terms of what you can compute with a “classical computer” (binary computer)
  • 48.
  • 49.
    Have we reached“quantum supremacy”? • A Google-designed quantum computer was used in an experiment to perform a calculation that would require a classical supercomputer 10,000 years to complete • The calculation was to predict the likelihood of outcomes from a random- number generator (a problem first crafted by Google physicists in 2016) • The 53-qubit quantum computer performed the calculation in 200 seconds • ?Was it a contrived experiment?
  • 50.
    Questions? Mt Whitney -14,505ft Lone Pine, California (Feb 2021)