Mendel E. Singer, PhD MPH
Associate Professor
Vice Chair for Education
Dept. of Population and Quantitative Health Sciences
Case School of Medicine
Case Western Reserve University
Cleveland, Ohio, USA
 PhD in Operations Research
• (Applied Math/Stat)
 Biostatistics
 Health Services Research
 Community and Public Health
• Did MPH program
 Done mathematical modeling studies and
biostatistics to claims data and EMR to semi-
structured interviews and focus groups
 Heavy in education administration,
transitioning to more research
 It’s a lot easier to use with consumer
behavior
• Tons of data, all usable and well categorized
 Manufacturer, cost, color, size, type of goods, etc…
• Data is mostly objective and easily measured and
tracked
• Interest drives purchasing
• Distance from data to profit is short, and easily and
quickly tested
 Methods well developed because of
business applications
 Reportable diseases
• Infectious disease tracking
 Surveillance, Epidemiology and End Results (SEER)
Program
• Details about the cancer (e.g. tumor size, location, histology)
• Details about treatment and mortality
• Linked to Medicare claims data
 Justice system
 Census data – sociodemographics by neighborhood
 Death certificates
 Emergency department visits for drug overdose
 Medical Claims Data (Medicare/Medicaid,
commercial)
 If the rate of fatal opioid overdose in my
home county was applied to the USA, it
would be the 3rd leading cause of death.
 Medical Examiner investigates each
accidental overdose.
• Toxicology – what drugs in system
• Data on injury (e.g. others present), history
(rehabilitation, incarceration, medical reason for
drug) and sociodemographics
 Public data – but only for those who died
 How do we identify those at risk?
 State prescription database
• Need identifiers; download patient at a time
 State emergency dept. data for overdoses
• Need identifiers; download one patient at a time
• Where do identifiers come from?
 Incarceration data
• Access difficult for people alive
 Medicare/Medicaid claims data
• Long lag time; access issues to data
 Emergency medical services (EMS) data updated every 15
minutes
• Not available for researchers
 Toxicology data from medical examiner’s lab (untapped)
• Fatalities, police lab
• Potential for de-identified data
 Police records
• Also difficult for people who are alive
 ClearPath is the merging of electronic
health records from the 3 main
hospital/health systems in Northeast Ohio
• Cleveland Clinic
• University Hospitals
• MetroHealth Medical Center
 Nearly all hospitals in Northeast Ohio
(43/45)
 All integrated health care delivery systems
• Full service outpatient, urgent care, emergency,
hospital
Almost 5 years in ….
Data from one system is almost ready
Many complications in actually getting
the data, process dragging
• Lots of delays
• Many levels of approval (lots of legal folks
involved)
• Turnover at hospitals
• Concern over being compared to other hospitals
• Benefit to hospital not sufficiently clear
All hospitalizations for most of the
patients
• And lots of outpatient data
Out of system use
• Not all physicians in one of these systems
• Data on Rx’s written, but not Rx’s filled
Different electronic health record
systems means data not coded
consistently across the health systems
 Social Media
• Individual postings, blogs, organization page
 Internet Searching
• Tracking outbreaks
• Side-effects/complications of treatment
 Grocery store purchases
• Effectiveness of programs to increase fruit and vegetable
consumption
 E.g. changes in displays, in-store marketing
 Prescription produce programs
 Wearable electronic devices, e.g. FitBit
 Sensors- detect movement and electricity
 Continuous data often results in substantial
computing and storage issues
 Insurance billing has been the driving force
behind computerized medical records in the U.S.
• Ease of filing
• Maximizing reimbursement
• Minimizing time to reimbursement (e.g. minimizing
rejection of claims)
 Track utilization of services
 Link outpatient, inpatient, prescription drug data
 Diagnosis and procedure codes
 Reimbursement (proxy for cost)
 Data available for purchase
 Advantages – Know what is being done
• Record of service utilization, including type, location,
reimbursement
• Diagnosis codes – for creating and following cohorts
• Hospital discharge and DRG codes
 Disadvantages – Don’t know details, limited
outcomes
• Tests are done – don’t know the result
• No knowledge of physician exam
• Don’t know about symptoms
• Crude proxies for severity (e.g. hospitalization,
multi-drug therapy)
 Benefits
• Utilization
• Test results, not just those taken
• Tests ordered but not taken
• Private as well as public insurance
• Many hospitals use the same system
 E.g. Epic is a very popular system in US Hospitals and
health care systems
• MIMIC database – free to the public
 De-identified intensive care unit (ICU) records
 50,000 ICU stays over 12 years
 Has all chart events, test results
 Other data also available – physionet.org
Challenges
• Physician notes and text reports – other free text
fields
 Fields often avoided – but that’s the largely untapped
potential
• Information on Rx’s written, but not those filled
• Few HMO’s left in US
 Kaiser downsized tremendously
 HMOs in Israel, but lots of inpatient out-of-system use
• Lack of connectedness – out of system use
 Doctors in different health care systems (or private
practice)
 Difficulty connecting better care to profits
• Value-based care a step in the right direction (fee for
service is bad business)
• Profit drives the software and analysis
• How to document/market better care and outcomes?
 Public hasn’t differentiated well between systems
 Failure of Report Cards
 Each clinical problem needs to be addressed
separately
• Can’t use a single algorithm with unique parameters,
like internet store
• Best data mining will be specific to institution
 The insurer is the one with the largest profit motive
 Better for HMO where insurer=provider
Data access
Variability across institutions
Inconsistency within systems
 Clinicians wary of computer guys treating their patients
 Americans and their docs don’t want anyone else telling
them how to practice
• “You need evidence, but I know” (a doc once told me this)
 Fear of being replaced by computers, not just aided by
• Some specialties are at risk of being downsized
• Image recognition very advanced - already being used for cancer
detection
 Many data scientists have no concept of clinical impact
• Get excited about incredibly small increase in accuracy
• My not make the effort to understand clinicians
 Data scientists tendency to believe you really don’t have
to know anything about the problem
• Data will tell us, without us getting in the way
• IBM Watson Health – algorithm is great, but need specific knowledge
for most applications. Layoffs.
• Algorithms without clinical input works for some things
• Often requires clinician conceptualization of variables
 Based on how they think in practice
Categorize data
• Effects often aren’t linear
• Easier to interpret and apply
 Buckets of patients by severity, risk
• Labels aligned to treatment guidelines
• Consistency across studies
 Categorize data
• Effects often aren’t linear
• Easier to interpret and apply
 Buckets of patients by severity, risk
• Labels aligned to treatment guidelines
• Consistency across studies
 Categories arbitrary, then evidence fixes the
categories (nearly forever)
• Data can guide the creation of categories
 Continuous data – no loss of information
• Think more in terms of probabilities, not risk group
• Ways of modeling non-linear effects
• Software handles this easily for years
 Done more in biological applications
Clinicians think binary
• Have disease or not
Continuous variables
Physician note fields largely useless
• Free text – not categorized
• Using it would be subjective
• Not practical – labor intensive
• Wouldn’t be reproducible
• Symptom not present OR not checked
• Wide practice variation in the visit
 Physician note fields largely useless
• Free text – not categorized
• Using it would be subjective
• Not practical – labor intensive
• Wouldn’t be reproducible
• Symptom not present OR not checked
• Wide practice variation in the visit
 Natural Language Processing (NLP)
• Despite above problems, it works great on
physician notes.
• Validates reasonably well across institutions
• Very well developed in English
Large number of missing values
makes variable useless
Might be systematically missing
Imputation methods help to a point
 Large number of missing values
makes variable useless
 Might be systematically missing
 Imputation methods help to a point
 Machine learning can deal with
missing values
 It can work in practice
 If variable reliably predicts bad
outcomes, who cares if it’s missing for
most?
• E.g. Rovsing’s sign in appendicitis
 If present, always appendicitis. But 1/10,000
Concern with overfitting models
Need many observations per
variable in the model
Concern with overfitting models
Need many observations per
variable in the model
• # Observations >> # Variables
Machine learning can work when
• #Variables > # Observations
Natural Language Processing for:
• Analyzing open-ended questions
• Analyzing text reports (police, criminal justice,
medical records)
Voice Assistant (Siri, Alexa) for
conducting interviews
• Trained to respond to voice commands
• Conduct semi-structured interviews!
Current Value
Mean
Minimum
Maximum
Range
Standard Deviation
Change from 1st value
# consecutive increases, decreases
MIMIC III – De-Identified ICU EMR
Patients admitted to hospital for kidney
injury/kidney failure
Data at 12 hours after admission to the
intensive care unit (ICU)
4 Clinical variables, age and sex
Outcomes:
• Mortality
• Dialysis
• Ventilator
 Compared Traditional Logistic Regression
with:
• Logistic with LASSO
• Classification Trees
• Random Forest
 Random Forest did much better than the
others – quite well with few measures
• Area Under Curve ~ .9
 Most common summary measures:
• Current Values
• # consecutive increases, decreases
 Some models used range, mean, std dev.,
change
Clearly Not Israelis
What data sources can work?
How do you identify drug addicts?
How do we know if they have mental
health challenges?
How can you determine if they have
received mental health treatment?
What are appropriate outcomes?
How can you determine their outcomes?
What are the obstacles and limitations?
What data sources can work?
How do you identify drug addicts?
• Police/criminal justice, rehabilitation center
How do we know if they have mental
health challenges?
• Electronic medical records, Prescriptions
• Psychiatric hospitals
How can you determine if they have
received mental health treatment?
• Health care claims data
• Electronic medical records, Prescriptions
What data sources can work?
 What are appropriate outcomes?
• Mental health visits
• Adherence with medication (Rx refills)
• Psychiatric hospitalization
• Suicide attempts
 How can you determine their outcomes?
• Electronic medical records
• Rx refills
 What are the obstacles and limitations?
• Access to various data sets
 Any data, getting identifiers
 Linking to HMO data
 Funding for study – including EMR data
Out of time ….
Questions?
Mendel Singer, PhD MPH
mendel@case.edu

Prof Mendel Singer Big Data Meets Public Health and Medicine 2018 12-22

  • 1.
    Mendel E. Singer,PhD MPH Associate Professor Vice Chair for Education Dept. of Population and Quantitative Health Sciences Case School of Medicine Case Western Reserve University Cleveland, Ohio, USA
  • 2.
     PhD inOperations Research • (Applied Math/Stat)  Biostatistics  Health Services Research  Community and Public Health • Did MPH program  Done mathematical modeling studies and biostatistics to claims data and EMR to semi- structured interviews and focus groups  Heavy in education administration, transitioning to more research
  • 3.
     It’s alot easier to use with consumer behavior • Tons of data, all usable and well categorized  Manufacturer, cost, color, size, type of goods, etc… • Data is mostly objective and easily measured and tracked • Interest drives purchasing • Distance from data to profit is short, and easily and quickly tested  Methods well developed because of business applications
  • 4.
     Reportable diseases •Infectious disease tracking  Surveillance, Epidemiology and End Results (SEER) Program • Details about the cancer (e.g. tumor size, location, histology) • Details about treatment and mortality • Linked to Medicare claims data  Justice system  Census data – sociodemographics by neighborhood  Death certificates  Emergency department visits for drug overdose  Medical Claims Data (Medicare/Medicaid, commercial)
  • 5.
     If therate of fatal opioid overdose in my home county was applied to the USA, it would be the 3rd leading cause of death.  Medical Examiner investigates each accidental overdose. • Toxicology – what drugs in system • Data on injury (e.g. others present), history (rehabilitation, incarceration, medical reason for drug) and sociodemographics  Public data – but only for those who died  How do we identify those at risk?
  • 6.
     State prescriptiondatabase • Need identifiers; download patient at a time  State emergency dept. data for overdoses • Need identifiers; download one patient at a time • Where do identifiers come from?  Incarceration data • Access difficult for people alive  Medicare/Medicaid claims data • Long lag time; access issues to data  Emergency medical services (EMS) data updated every 15 minutes • Not available for researchers  Toxicology data from medical examiner’s lab (untapped) • Fatalities, police lab • Potential for de-identified data  Police records • Also difficult for people who are alive
  • 7.
     ClearPath isthe merging of electronic health records from the 3 main hospital/health systems in Northeast Ohio • Cleveland Clinic • University Hospitals • MetroHealth Medical Center  Nearly all hospitals in Northeast Ohio (43/45)  All integrated health care delivery systems • Full service outpatient, urgent care, emergency, hospital
  • 8.
    Almost 5 yearsin …. Data from one system is almost ready Many complications in actually getting the data, process dragging • Lots of delays • Many levels of approval (lots of legal folks involved) • Turnover at hospitals • Concern over being compared to other hospitals • Benefit to hospital not sufficiently clear
  • 9.
    All hospitalizations formost of the patients • And lots of outpatient data Out of system use • Not all physicians in one of these systems • Data on Rx’s written, but not Rx’s filled Different electronic health record systems means data not coded consistently across the health systems
  • 10.
     Social Media •Individual postings, blogs, organization page  Internet Searching • Tracking outbreaks • Side-effects/complications of treatment  Grocery store purchases • Effectiveness of programs to increase fruit and vegetable consumption  E.g. changes in displays, in-store marketing  Prescription produce programs  Wearable electronic devices, e.g. FitBit  Sensors- detect movement and electricity  Continuous data often results in substantial computing and storage issues
  • 11.
     Insurance billinghas been the driving force behind computerized medical records in the U.S. • Ease of filing • Maximizing reimbursement • Minimizing time to reimbursement (e.g. minimizing rejection of claims)  Track utilization of services  Link outpatient, inpatient, prescription drug data  Diagnosis and procedure codes  Reimbursement (proxy for cost)  Data available for purchase
  • 12.
     Advantages –Know what is being done • Record of service utilization, including type, location, reimbursement • Diagnosis codes – for creating and following cohorts • Hospital discharge and DRG codes  Disadvantages – Don’t know details, limited outcomes • Tests are done – don’t know the result • No knowledge of physician exam • Don’t know about symptoms • Crude proxies for severity (e.g. hospitalization, multi-drug therapy)
  • 13.
     Benefits • Utilization •Test results, not just those taken • Tests ordered but not taken • Private as well as public insurance • Many hospitals use the same system  E.g. Epic is a very popular system in US Hospitals and health care systems • MIMIC database – free to the public  De-identified intensive care unit (ICU) records  50,000 ICU stays over 12 years  Has all chart events, test results  Other data also available – physionet.org
  • 14.
    Challenges • Physician notesand text reports – other free text fields  Fields often avoided – but that’s the largely untapped potential • Information on Rx’s written, but not those filled • Few HMO’s left in US  Kaiser downsized tremendously  HMOs in Israel, but lots of inpatient out-of-system use • Lack of connectedness – out of system use  Doctors in different health care systems (or private practice)
  • 15.
     Difficulty connectingbetter care to profits • Value-based care a step in the right direction (fee for service is bad business) • Profit drives the software and analysis • How to document/market better care and outcomes?  Public hasn’t differentiated well between systems  Failure of Report Cards  Each clinical problem needs to be addressed separately • Can’t use a single algorithm with unique parameters, like internet store • Best data mining will be specific to institution  The insurer is the one with the largest profit motive  Better for HMO where insurer=provider
  • 16.
    Data access Variability acrossinstitutions Inconsistency within systems
  • 17.
     Clinicians waryof computer guys treating their patients  Americans and their docs don’t want anyone else telling them how to practice • “You need evidence, but I know” (a doc once told me this)  Fear of being replaced by computers, not just aided by • Some specialties are at risk of being downsized • Image recognition very advanced - already being used for cancer detection  Many data scientists have no concept of clinical impact • Get excited about incredibly small increase in accuracy • My not make the effort to understand clinicians  Data scientists tendency to believe you really don’t have to know anything about the problem • Data will tell us, without us getting in the way • IBM Watson Health – algorithm is great, but need specific knowledge for most applications. Layoffs. • Algorithms without clinical input works for some things • Often requires clinician conceptualization of variables  Based on how they think in practice
  • 19.
    Categorize data • Effectsoften aren’t linear • Easier to interpret and apply  Buckets of patients by severity, risk • Labels aligned to treatment guidelines • Consistency across studies
  • 20.
     Categorize data •Effects often aren’t linear • Easier to interpret and apply  Buckets of patients by severity, risk • Labels aligned to treatment guidelines • Consistency across studies  Categories arbitrary, then evidence fixes the categories (nearly forever) • Data can guide the creation of categories  Continuous data – no loss of information • Think more in terms of probabilities, not risk group • Ways of modeling non-linear effects • Software handles this easily for years  Done more in biological applications
  • 21.
    Clinicians think binary •Have disease or not Continuous variables
  • 22.
    Physician note fieldslargely useless • Free text – not categorized • Using it would be subjective • Not practical – labor intensive • Wouldn’t be reproducible • Symptom not present OR not checked • Wide practice variation in the visit
  • 23.
     Physician notefields largely useless • Free text – not categorized • Using it would be subjective • Not practical – labor intensive • Wouldn’t be reproducible • Symptom not present OR not checked • Wide practice variation in the visit  Natural Language Processing (NLP) • Despite above problems, it works great on physician notes. • Validates reasonably well across institutions • Very well developed in English
  • 24.
    Large number ofmissing values makes variable useless Might be systematically missing Imputation methods help to a point
  • 25.
     Large numberof missing values makes variable useless  Might be systematically missing  Imputation methods help to a point  Machine learning can deal with missing values  It can work in practice  If variable reliably predicts bad outcomes, who cares if it’s missing for most? • E.g. Rovsing’s sign in appendicitis  If present, always appendicitis. But 1/10,000
  • 26.
    Concern with overfittingmodels Need many observations per variable in the model
  • 27.
    Concern with overfittingmodels Need many observations per variable in the model • # Observations >> # Variables Machine learning can work when • #Variables > # Observations
  • 28.
    Natural Language Processingfor: • Analyzing open-ended questions • Analyzing text reports (police, criminal justice, medical records) Voice Assistant (Siri, Alexa) for conducting interviews • Trained to respond to voice commands • Conduct semi-structured interviews!
  • 29.
  • 30.
    MIMIC III –De-Identified ICU EMR Patients admitted to hospital for kidney injury/kidney failure Data at 12 hours after admission to the intensive care unit (ICU) 4 Clinical variables, age and sex Outcomes: • Mortality • Dialysis • Ventilator
  • 31.
     Compared TraditionalLogistic Regression with: • Logistic with LASSO • Classification Trees • Random Forest  Random Forest did much better than the others – quite well with few measures • Area Under Curve ~ .9  Most common summary measures: • Current Values • # consecutive increases, decreases  Some models used range, mean, std dev., change
  • 32.
  • 33.
    What data sourcescan work? How do you identify drug addicts? How do we know if they have mental health challenges? How can you determine if they have received mental health treatment? What are appropriate outcomes? How can you determine their outcomes? What are the obstacles and limitations?
  • 34.
    What data sourcescan work? How do you identify drug addicts? • Police/criminal justice, rehabilitation center How do we know if they have mental health challenges? • Electronic medical records, Prescriptions • Psychiatric hospitals How can you determine if they have received mental health treatment? • Health care claims data • Electronic medical records, Prescriptions
  • 35.
    What data sourcescan work?  What are appropriate outcomes? • Mental health visits • Adherence with medication (Rx refills) • Psychiatric hospitalization • Suicide attempts  How can you determine their outcomes? • Electronic medical records • Rx refills  What are the obstacles and limitations? • Access to various data sets  Any data, getting identifiers  Linking to HMO data  Funding for study – including EMR data
  • 36.
  • 37.
    Questions? Mendel Singer, PhDMPH mendel@case.edu