Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22

Mendel E. Singer, PhD MPH
Associate Professor
Vice Chair for Education
Dept. of Population and Quantitative Health Sciences
Case School of Medicine
Case Western Reserve University
Cleveland, Ohio, USA

 PhD in Operations Research
• (Applied Math/Stat)
 Biostatistics
 Health Services Research
 Community and Public Health
• Did MPH program
 Done mathematical modeling studies and
biostatistics to claims data and EMR to semi-
structured interviews and focus groups
 Heavy in education administration,
transitioning to more research

 It’s a lot easier to use with consumer
behavior
• Tons of data, all usable and well categorized
 Manufacturer, cost, color, size, type of goods, etc…
• Data is mostly objective and easily measured and
tracked
• Interest drives purchasing
• Distance from data to profit is short, and easily and
quickly tested
 Methods well developed because of
business applications

 Reportable diseases
• Infectious disease tracking
 Surveillance, Epidemiology and End Results (SEER)
Program
• Details about the cancer (e.g. tumor size, location, histology)
• Details about treatment and mortality
• Linked to Medicare claims data
 Justice system
 Census data – sociodemographics by neighborhood
 Death certificates
 Emergency department visits for drug overdose
 Medical Claims Data (Medicare/Medicaid,
commercial)

 If the rate of fatal opioid overdose in my
home county was applied to the USA, it
would be the 3rd leading cause of death.
 Medical Examiner investigates each
accidental overdose.
• Toxicology – what drugs in system
• Data on injury (e.g. others present), history
(rehabilitation, incarceration, medical reason for
drug) and sociodemographics
 Public data – but only for those who died
 How do we identify those at risk?

 State prescription database
• Need identifiers; download patient at a time
 State emergency dept. data for overdoses
• Need identifiers; download one patient at a time
• Where do identifiers come from?
 Incarceration data
• Access difficult for people alive
 Medicare/Medicaid claims data
• Long lag time; access issues to data
 Emergency medical services (EMS) data updated every 15
minutes
• Not available for researchers
 Toxicology data from medical examiner’s lab (untapped)
• Fatalities, police lab
• Potential for de-identified data
 Police records
• Also difficult for people who are alive

 ClearPath is the merging of electronic
health records from the 3 main
hospital/health systems in Northeast Ohio
• Cleveland Clinic
• University Hospitals
• MetroHealth Medical Center
 Nearly all hospitals in Northeast Ohio
(43/45)
 All integrated health care delivery systems
• Full service outpatient, urgent care, emergency,
hospital

Almost 5 years in ….
Data from one system is almost ready
Many complications in actually getting
the data, process dragging
• Lots of delays
• Many levels of approval (lots of legal folks
involved)
• Turnover at hospitals
• Concern over being compared to other hospitals
• Benefit to hospital not sufficiently clear

All hospitalizations for most of the
patients
• And lots of outpatient data
Out of system use
• Not all physicians in one of these systems
• Data on Rx’s written, but not Rx’s filled
Different electronic health record
systems means data not coded
consistently across the health systems

 Social Media
• Individual postings, blogs, organization page
 Internet Searching
• Tracking outbreaks
• Side-effects/complications of treatment
 Grocery store purchases
• Effectiveness of programs to increase fruit and vegetable
consumption
 E.g. changes in displays, in-store marketing
 Prescription produce programs
 Wearable electronic devices, e.g. FitBit
 Sensors- detect movement and electricity
 Continuous data often results in substantial
computing and storage issues

 Insurance billing has been the driving force
behind computerized medical records in the U.S.
• Ease of filing
• Maximizing reimbursement
• Minimizing time to reimbursement (e.g. minimizing
rejection of claims)
 Track utilization of services
 Link outpatient, inpatient, prescription drug data
 Diagnosis and procedure codes
 Reimbursement (proxy for cost)
 Data available for purchase

 Advantages – Know what is being done
• Record of service utilization, including type, location,
reimbursement
• Diagnosis codes – for creating and following cohorts
• Hospital discharge and DRG codes
 Disadvantages – Don’t know details, limited
outcomes
• Tests are done – don’t know the result
• No knowledge of physician exam
• Don’t know about symptoms
• Crude proxies for severity (e.g. hospitalization,
multi-drug therapy)

 Benefits
• Utilization
• Test results, not just those taken
• Tests ordered but not taken
• Private as well as public insurance
• Many hospitals use the same system
 E.g. Epic is a very popular system in US Hospitals and
health care systems
• MIMIC database – free to the public
 De-identified intensive care unit (ICU) records
 50,000 ICU stays over 12 years
 Has all chart events, test results
 Other data also available – physionet.org

Challenges
• Physician notes and text reports – other free text
fields
 Fields often avoided – but that’s the largely untapped
potential
• Information on Rx’s written, but not those filled
• Few HMO’s left in US
 Kaiser downsized tremendously
 HMOs in Israel, but lots of inpatient out-of-system use
• Lack of connectedness – out of system use
 Doctors in different health care systems (or private
practice)

 Difficulty connecting better care to profits
• Value-based care a step in the right direction (fee for
service is bad business)
• Profit drives the software and analysis
• How to document/market better care and outcomes?
 Public hasn’t differentiated well between systems
 Failure of Report Cards
 Each clinical problem needs to be addressed
separately
• Can’t use a single algorithm with unique parameters,
like internet store
• Best data mining will be specific to institution
 The insurer is the one with the largest profit motive
 Better for HMO where insurer=provider

Data access
Variability across institutions
Inconsistency within systems

 Clinicians wary of computer guys treating their patients
 Americans and their docs don’t want anyone else telling
them how to practice
• “You need evidence, but I know” (a doc once told me this)
 Fear of being replaced by computers, not just aided by
• Some specialties are at risk of being downsized
• Image recognition very advanced - already being used for cancer
detection
 Many data scientists have no concept of clinical impact
• Get excited about incredibly small increase in accuracy
• My not make the effort to understand clinicians
 Data scientists tendency to believe you really don’t have
to know anything about the problem
• Data will tell us, without us getting in the way
• IBM Watson Health – algorithm is great, but need specific knowledge
for most applications. Layoffs.
• Algorithms without clinical input works for some things
• Often requires clinician conceptualization of variables
 Based on how they think in practice

Categorize data
• Effects often aren’t linear
• Easier to interpret and apply
 Buckets of patients by severity, risk
• Labels aligned to treatment guidelines
• Consistency across studies

 Categorize data
• Effects often aren’t linear
• Easier to interpret and apply
 Buckets of patients by severity, risk
• Labels aligned to treatment guidelines
• Consistency across studies
 Categories arbitrary, then evidence fixes the
categories (nearly forever)
• Data can guide the creation of categories
 Continuous data – no loss of information
• Think more in terms of probabilities, not risk group
• Ways of modeling non-linear effects
• Software handles this easily for years
 Done more in biological applications

Clinicians think binary
• Have disease or not
Continuous variables

Physician note fields largely useless
• Free text – not categorized
• Using it would be subjective
• Not practical – labor intensive
• Wouldn’t be reproducible
• Symptom not present OR not checked
• Wide practice variation in the visit

 Physician note fields largely useless
• Free text – not categorized
• Using it would be subjective
• Not practical – labor intensive
• Wouldn’t be reproducible
• Symptom not present OR not checked
• Wide practice variation in the visit
 Natural Language Processing (NLP)
• Despite above problems, it works great on
physician notes.
• Validates reasonably well across institutions
• Very well developed in English

Large number of missing values
makes variable useless
Might be systematically missing
Imputation methods help to a point

 Large number of missing values
makes variable useless
 Might be systematically missing
 Imputation methods help to a point
 Machine learning can deal with
missing values
 It can work in practice
 If variable reliably predicts bad
outcomes, who cares if it’s missing for
most?
• E.g. Rovsing’s sign in appendicitis
 If present, always appendicitis. But 1/10,000

Concern with overfitting models
Need many observations per
variable in the model

Concern with overfitting models
Need many observations per
variable in the model
• # Observations >> # Variables
Machine learning can work when
• #Variables > # Observations

Natural Language Processing for:
• Analyzing open-ended questions
• Analyzing text reports (police, criminal justice,
medical records)
Voice Assistant (Siri, Alexa) for
conducting interviews
• Trained to respond to voice commands
• Conduct semi-structured interviews!

Current Value
Mean
Minimum
Maximum
Range
Standard Deviation
Change from 1st value
# consecutive increases, decreases

MIMIC III – De-Identified ICU EMR
Patients admitted to hospital for kidney
injury/kidney failure
Data at 12 hours after admission to the
intensive care unit (ICU)
4 Clinical variables, age and sex
Outcomes:
• Mortality
• Dialysis
• Ventilator

 Compared Traditional Logistic Regression
with:
• Logistic with LASSO
• Classification Trees
• Random Forest
 Random Forest did much better than the
others – quite well with few measures
• Area Under Curve ~ .9
 Most common summary measures:
• Current Values
• # consecutive increases, decreases
 Some models used range, mean, std dev.,
change

What data sources can work?
How do you identify drug addicts?
How do we know if they have mental
health challenges?
How can you determine if they have
received mental health treatment?
What are appropriate outcomes?
How can you determine their outcomes?
What are the obstacles and limitations?

How do you identify drug addicts?
• Police/criminal justice, rehabilitation center
How do we know if they have mental
health challenges?
• Electronic medical records, Prescriptions
• Psychiatric hospitals
How can you determine if they have
received mental health treatment?
• Health care claims data
• Electronic medical records, Prescriptions

 What are appropriate outcomes?
• Mental health visits
• Adherence with medication (Rx refills)
• Psychiatric hospitalization
• Suicide attempts
 How can you determine their outcomes?
• Electronic medical records
• Rx refills
 What are the obstacles and limitations?
• Access to various data sets
 Any data, getting identifiers
 Linking to HMO data
 Funding for study – including EMR data

Questions?
Mendel Singer, PhD MPH
mendel@case.edu

Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22

Similar to Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22 (20)

More from mjbinstitute

More from mjbinstitute (20)

Recently uploaded

Recently uploaded (20)

Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22