Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Product  Decisions
through
Big  Data
Center  for  Data  Science
Ankur  Teredesai
University  of  Washington  Tacoma
1
Marc...
• Bioinformatics
• Health  and  
Wellness
• Predictive  Analytics
Health  
Informatics
• Distributed  Systems
• Databases
...
Machine  Learning
Analytics
Engineering
Features
AlgorithmScalability
ELT
Integrate  
Sources
Constraints
Deploy  Models
A...
Data  Mining:  1989  -­‐ 2010  
• Data  Science  and  
Applications  move  and  
transform  sizeable  amounts  
of  data  ...
Data  Science  uses  native  data  
representation  and  inherent  distribution  
and  parallelism
Minimal  data  movement...
A  Short  History  of  (Big)  Data  Technology
1970:  Codd  invents  “A  
Relational  Model  of  
Data  for  Large  Shared...
Technology  Decisions
7
Columnar  Vs Relational  Storage  
Technologies
Infinite  scale  using  commodity  
hardware
Priva...
Flat  Files  CSV Claims  X12 Clinical    HL7
Distance  Compute  Library
Instance  Selection  
RNGE Drop  3
Fuzzy  Rough  S...
iTornado
Routing  Service  With  Real  World  Severe  
Weather
Demo  Paper  in  ACM  SIGSPATIAL 2014
(Best  Demo  paper  a...
COMA
Road  Network  Compression  For  Map  
Matching
ACM  SigSpatial IWGS  2014
PreGo
Dynamic  Multi-­‐Preference  Routing
Single  
Attribute
Multiple
Attribute
Time-­‐
Homogenous
Dijkstra,  A* Stewart ...
Special Needs Education: Teacher Trainer Effectiveness Analysis
Customized Surveys
Training Registration
Survey Management...
Systems  Biology
13
Predictive  Models  
and  software
Applications:  Personalized  
medicine,  drug  discovery
Focus:  De...
A  Flying  Hadoop Cluster
14
Detecting  Malware  Activity  based  on  
Automatically  Generated  Domains
Command  &  Control  
xyz.com xyz.com
Infected...
(March  2012)
• Will  this  Heart  Failure  patient  
get  readmitted  within  30  days?
• Yes  or  No  (Binary  Classific...
Affordable  Care  Act  =>  Avoidable  Costs
Readmissions  are  AVOIDABLE
20%
32%
30  days
60  days
75%
25% Non  CHF
CHF
• ...
Patient
Class
Labels
No  
readmission
Readmission
CHF  ROR:  30-­‐Day  Hospital  Readmission  Risk  
Prediction
Machine  
...
19
Some of the Steps
Data  
Understanding
And  Integration
Data  
Cleaning
Data  
Transformation
Extracting    data  from ...
Public  Data:
State  Inpatient  Dataset  2009-­‐2012
20
AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE   DXCCS1 PRCCS1 TOTCHG
52 9...
Variety  and  Volume  (2/3  V’s  of  Big  Data)
Pre  Admission Post  Admission Pre-­‐ Discharge Discharge
-­‐ Demographics...
(Dec  2012)  Initial  Models  
22
Data  integration
Feature  Construction
Predictive  modeling
• Logistic  Regression
• Na...
(July  2013)  (much  better)   &  Some  Papers
§ Improved  data  exploration
§ S.-­‐C. Chin, K. Zolfaghar, S. Basu Roy, ...
(Dec  2013)  Prototype  or  a  possible  Product?  
&  yes,  More  Papers
§ Successful  Deployment
24
§K. Zolfaghar, J. ...
25
Multi  Layer  Classifier  :  Automatically  Detecting  
Classification  Windows
Will  patient ever readmit?
Will  patie...
Generalizing  the  30,60,90  Day  readmission
§ Automatic  design  of  time  prediction  hierarchy
§ Feature  selection ...
Automatic  design  of  prediction  hierarchy
27
Simple  3  Layer  Example
• Stage  1:  Design  a  predictive  model  for  the  patients  who  are  likely  to  
come  back...
Hill  Climbing  Algorithm  to  Detect  K
§ Generate  a  random  number    K  between  X  and  30
§ Compute   C1=  Centro...
30
Calculating  the  Probability  of  30  day  RoR
P(readmit ≤ 30) = P(≤ 30 |≤ K)× P(≤ K |Y)P(Y)
Risk-­‐O-­‐Meter
Distinguishing  Features
31
Risk-­‐O-­‐Meter
Users
Current  Systems
Healthcare  provider
and  Patients
On...
All  in  one  Package  – Risk-­‐O-­‐Meter  (KDD  2013)
32
Pre  Admission Post  Admission Pre  -­‐ Discharge Discharge
Post-­‐Discharge  
Care  
Management  
Pipeline
“White  Gap”PC...
Risk  – Done
Cost  – Done
Next?  
Actionable  Interventions
If  we  can  predict  can  we  recommend?
34
A  Framework  to ...
A  real  and common Chronic  Readmission
75-­‐year  old,  female
Chronic  pulmonary  disease,  
depression,  hypertension
...
Risk  will  be  
lower  when  the  
interventions  
are  performed
The  patient  is  
not  readmitted
Intervention  Rule  ...
Recommendation  for  New  Patient
Intervention  Plan  1
Major  Operating  Room,  Chest  X-­‐ray  and  others
Intervention ...
Validation  – Data  Highlights
• State  Inpatient  Database  (SID) of  Washington  State  heart  failure  cohort  in  year...
Validation – Experiment Results
39
0
100
200
300
400
Linear  
Regression
Hill-­‐Climbing Grow-­‐Shrink Hybrid
Hits
0.34
0....
Back  to  the  Chronic  Readmission  Case
75-­‐year  old,  female
Chronic  pulmonary  disease,  
depression,  hypertension...
Accountable  Care  Organizations
Cost/Charge  Prediction
41
HealthSCOPE:  An  Interactive  Distributed  Data  Mining  Fram...
42
What  are  healthcare  
costs  for  assigned  
population  in  2015  ?
Why  is  the  cost  so  
high  or  low  ?
How  d...
Cost/Charge  Prediction:  Problem  Description
• Goal  à predict  the  future  healthcare  cost  of  individuals  based  ...
foo 44
Four  Scenarios  for  predicting  cost  
• Three  Months  of  Historical  data  (Medical,  Demographic  and  Cost)
...
Non-­‐ Gaussian  Distribution  of  Healthcare  Costs
foo 45
Makes  it  challenging  and  interesting  problem  for  resear...
Existing  Cost  prediction  Methods
• Limited  to  Rule  based  or  Multiple  Linear  Regression  methods
• Rule  Based  m...
Our  Contributions
• Investigate  the  utility  of  state-­‐of–the  –art  machine  learning    
algorithms  for  the  cost...
Regression  Tree
48
Age  >  60?
Has  
Asthma?
Gender  =  
Female?
21,00046,00062,00085,000
Yes
Yes Yes
No
No No
M5  Model  Tree
foo 49
Has  
Asthma?
Gender  =  
Female?
Yes
Yes Yes
No
No No
Age  >  60?
Random  Forest
50
Had  
Procedure  
X?
Age  >  18?
Gender  =  
Male?
21,00046,00062,00085,000
Yes
Yes Yes
#  Admits  
>  3...
51
Evaluation  Metrics
• Mean  Absolute  Error  (MAE)
• Root  Mean  Squared  Error  (RMSE)
52
MAE  Results  – SID  Data  (3Q  Scenario)
0
5,000
10,000
15,000
20,000
25,000
30,000
Average  
Baseline
Previous  
Cost...
53
MAE  Results  – MEPS  Data
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Average  
Baseline
Previous  
Cost  
Regressi...
54
Prediction  Error  Results  – M5  Model  Trees
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
1Q 2Q 3...
Error  Distribution:  WA  State  SID  Data
foo 55
For  large  fraction  of  of  the  
population  (75%),  we  were  able  ...
Sub-­‐Population  Cost  Prediction
Prediction
Prediction
Prediction
Population
Sub-­‐Population
Future
Healthcare
Cost
Con...
Most  difficult  cohort  to  predict
foo 57
0
5000
10000
15000
20000
25000
30000
35000
Asthma Diabetes CHF COPD Coronary O...
Engineering  the  Solutions:  
Risk-­‐O-­‐Readmission  &  Cost-­‐As-­‐a  
Service
58
Thu,  Nov  7,  2013  at  10:50  AM
59
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Forwarded  message  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐
Fr...
Risk-­‐of-­‐Readmission  as  a  Service
60
Web  App  
for  ACOs
Model  
Selector
Cost  Prediction  API
Beneficiary    Claims
Population  Batch/Individual
A
B
Linear ...
Web  App  
for  ACOs
Model  
Selector
Cost  Prediction  API
Beneficiary    Claims
Population  Batch/Individual
A
B
Linear ...
Apache  Spark
foo 63
Apache  Spark
HDFS
Slave  1
Slave  1
Master
Driver RDD
In  Memory  Data
Partition  1
In  Memory  Data...
Weighted  k-­‐NN  for  Regression
foo 64
Data  
Partition  1
kNN1
Predicted  Cost
kNN2
2k  NN
kNN
Node  1
Data  
Partition...
Rough  Set
• Rough set theory is an ML framework that
is especially suitable for information
systems with inconsistencies....
Fuzzy  Rough  Set
• Uses  fuzzy  logic  to  handle  continuous  
attributes.
• Similarity  matrix  contains  values  
betw...
Fuzzy  Rough  Set
• Let rj,i be the degree of similarity of instances i and j.
• Let ci be the degree to which instance i ...
Fuzzy  Rough  Set  
max{min(rj,i, ci) | i = 1,...,n}
Fuzzy  Rough  Set
min{max(1-rj,,i, ci) | i = 1,...,n}
Implementation
• The construction of the similarity matrix
can be done in a
parallel manner, making each of K
compute node...
Implementation  -­‐
Lower  Approximation
Upper  Approximation
Spark  vs MPI              
Fuzzy  Rough  Set
Web  App  
for  ACOs
Model  
Selector
Cost  Prediction  API
Beneficiary    Claims
Population  Batch/Individual
A
B
Linear ...
Readmission  Application
• Android
• Windows  Phone
• Patient  View
• what  is  my  risk
• Doctor  View  
• who  are  my  ...
foo 75
http://healthscope.cloudapp.net/hscope-­‐dev/aco/
Healthcare  Scalable  COst  Prediction  Engine  (HealthSCOPE)
0.6  AUC
Yale  Model
(Baseline)
76
Milestones:  Readmission  Risk
0.64  AUC
UW  2012  
Result
Ensemble  
method,  
Hierarc...
Problem  
Explorat
ion
77
Milestones:  Cost  Prediction
H-­‐SCOPE  I
SID  Data
June  2014
H-­‐SCOPE  IV
SID  +  MEPS  
dat...
78
AUC  – Accuracy  measure  
(Area  Under  Curve)
2012
78
Milestones:  Merging  Threads
2016  and  beyond2013 2014 2015
R...
Flat  Files  CSV Claims  X12 Clinical    HL7
Distance  Compute  Library
Instance  Selection  
RNGE Drop  3
Fuzzy  Rough  S...
Flat  Files  CSV Claims  X12 Clinical    HL7
Distance  Compute  Library
Instance  Selection  
RNGE Drop  3
Fuzzy  Rough  S...
81
Our  Sincere  Thanks for  Your  Support!
Upcoming SlideShare
Loading in …5
×

Societal Impact of Applied Data Science on the Big Data Stack

1,709 views

Published on

Data availability should ideally improve accountability and decision processes. Armed with evidence of data science working across multiple domains from healthcare analytics to internet advertising big data is enabling changes in society, one application at a time. This talk will have two parts. We will first present a data scientist's overview of different technologies in use today and their utility.

Then we will do a deep-dive on specific implementation and challenges we addressed while working with multiple partners in the healthcare industry on real-world healthcare data. We will discuss and demonstrate prototypes of our solutions for cost prediction and risk-of-readmission care management, and how we leveraged big data machine learning frameworks. We will end with an open conversation about challenges in verticals other than healthcare and provide an overview of ongoing efforts for social good at the University of Washington Center for Data Science; each a story in its own.

Published in: Data & Analytics

Societal Impact of Applied Data Science on the Big Data Stack

  1. 1. Product  Decisions through Big  Data Center  for  Data  Science Ankur  Teredesai University  of  Washington  Tacoma 1 March  14th,  2015
  2. 2. • Bioinformatics • Health  and   Wellness • Predictive  Analytics Health   Informatics • Distributed  Systems • Databases • Geo-­‐Spatial • Embedded  Systems Geo-­‐Spatial  Data   Management • Machine  Learning • Data  Mining • Computation   Intelligence • Computer  Vision Intelligent   Systems • Web • Devices • Mobile  Networks • UX  /  UI Social  Computing • Cryptology • Secure  Machine   Learning Big  Data  Security • Engineering • Dev-­‐Ops Big  Data   Infrastructure Center  for  Data  Science:  Societal  Impact
  3. 3. Machine  Learning Analytics Engineering Features AlgorithmScalability ELT Integrate   Sources Constraints Deploy  Models APIs Apps Data  Struggles A  Big  Data  Project  Blueprint: 3
  4. 4. Data  Mining:  1989  -­‐ 2010   • Data  Science  and   Applications  move  and   transform  sizeable  amounts   of  data  out  of  the  native   database  or  file  systems. Applications SQL/ODBC/JDBC  Data  Access Distributed  Database Multi-­Core,  Columnar,   Key-­Value Distributed  Database Multi-­Core,  Columnar,   Key-­Value Distributed  Database Multi-­Core,  Columnar,   Key-­Value Distributed  Database Multi-­Core,  Columnar,   Key-­Value Data  Science  using  R,   SAS,  SPSS,  Weka,  MAHOUT H I G H V O L U M E H I G H L A T E N C Y H I G H V O L U M E Application  Ecosystem  Integration
  5. 5. Data  Science  uses  native  data   representation  and  inherent  distribution   and  parallelism Minimal  data  movement Rapid  Application  development  using   data  science  constructs 5 Big  Data  Science Application  Ecosystem  Integration Applications SQL/ODBC/JDBC  Data  Access Data  Science •Internal  Algorithms  for  clustering,   •classification,    regression Distributed  Database Multi-­Core,  Columnar,  Key-­Value L O W E R V O L U M E L O W E R L A T E N C Y H I G H V O L U M E L O W L A T E N C YBig  Data  Science  Components
  6. 6. A  Short  History  of  (Big)  Data  Technology 1970:  Codd  invents  “A   Relational  Model  of   Data  for  Large  Shared   Data  Banks” 1985:  Copeland  – Decomposition  Storage   Model  (essentially  the   first  Columnar  Store) 1989:  Shared-­‐Nothing   Architecture 2004:  Google  – MapReduce 2005:  C-­‐Store   (Eventually  Vertica),   layers  WS/RS 2007:  Materialization   Optimizations  in   Columnar  Stores  and   Hadoop Implementation 2005-­‐07:  Star-­‐Schema   Benchmark +  Hadoop 2008:  Attempts  to   backport columnar   advances  to  row   storage,  not  very   effective Today:  BIG  DATA
  7. 7. Technology  Decisions 7 Columnar  Vs Relational  Storage   Technologies Infinite  scale  using  commodity   hardware Private  or  Public  Cloud Massively  Distributed  and   Parallel  Architecture:  Hadoop Stream  Query  Processing  for   trillions  of  events  and  petabytes  of   data Real-­time  classification and   clustering:  Approximate  scoring   and  segmentation  +  Reporting   and  Data  Visualization
  8. 8. Flat  Files  CSV Claims  X12 Clinical    HL7 Distance  Compute  Library Instance  Selection   RNGE Drop  3 Fuzzy  Rough  Set   Approximation CHF  Risk  of   Readmission Geo   Routing Random  Forests KNN Industry  Partners  and  Domain  Experts Other   Solutions HDFS NUMA MPI Grappa Census  US  Gov Unstructured  CCD Bayesian   Networks Support  Vector Machines 8 Cost  of  Chronic   Interventions Age/Gender   Prediction Malware   Analytics Personalized   Cancer  Therapy ETL  Tools Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry) Sqoop
  9. 9. iTornado Routing  Service  With  Real  World  Severe   Weather Demo  Paper  in  ACM  SIGSPATIAL 2014 (Best  Demo  paper  award) Fatalities  Stats  byWeather Related  Hazards   http://www.nws.noaa.gov,  June  2014.
  10. 10. COMA Road  Network  Compression  For  Map   Matching ACM  SigSpatial IWGS  2014
  11. 11. PreGo Dynamic  Multi-­‐Preference  Routing Single   Attribute Multiple Attribute Time-­‐ Homogenous Dijkstra,  A* Stewart  et  al  91 Time-­‐Variant Betsy  et al  07 ? <3,4> <2,2> <5,7> <0,0> a s b e T=[1,2,3,4,5] R=[1,2,3,4,5] T=[1,2,3,4,5] R=[1,2,3,4,5] d c g f h T=[1,2,3,4,5] R=[1,2,3,4,5] T=[1,2,3,4,5] R=[1,2,3,4,5] T=[5,1,3,4,5] R=[7,1,2,4,5] T=[1,1,3,4,5] R=[1,2,3,4,5] T=[2,1,3,4,5] R=[2,1,3,4,5] T=[1,2,2,4,3] R=[2,1,5,4,3] T=[1,2,3,1,1] R=[1,2,3,0,1] <1,1> <4,4> T=[4,2,1,3,5] R=[3,2,1,4,5]
  12. 12. Special Needs Education: Teacher Trainer Effectiveness Analysis Customized Surveys Training Registration Survey Management To  support  streamlined  data  collection  and   performance  evaluation  across  the  State  Needs   Projects. Project Stakeholders Office of the Superintendent of Public Instruction Center for Data Science Data Dashboard Purpose Report Generation Geographic Distribution Maps Demographic Reports Brad Porter, Aniruddha Desai, Yitao Li, David Hazel, Michelle Maike, Greg Benner, Ankur Teredesai, Leslie Pyper, Vickie Green
  13. 13. Systems  Biology 13 Predictive  Models   and  software Applications:  Personalized   medicine,  drug  discovery Focus:  Develop  machine  learning   methods  and  tools  to  effectively   integrate  multiple  big  data  sources  in   biology.
  14. 14. A  Flying  Hadoop Cluster 14
  15. 15. Detecting  Malware  Activity  based  on   Automatically  Generated  Domains Command  &  Control   xyz.com xyz.com Infected  node Partnering  with  NIARA  we  obtained  a  large  dataset  of  Automatically  Generated  Domains.   Based    on  the  intercepted  domain  features  we   are  able  to  identify  the  malware  infecting  a   network.  
  16. 16. (March  2012) • Will  this  Heart  Failure  patient   get  readmitted  within  30  days? • Yes  or  No  (Binary  Classification) 16 Reduce  CHF   Readmission Readmission  ? Machine  Learning? Joint  NSF  /  NIH  Solicitation  on  Health  Care  and  Big  Data
  17. 17. Affordable  Care  Act  =>  Avoidable  Costs Readmissions  are  AVOIDABLE 20% 32% 30  days 60  days 75% 25% Non  CHF CHF • Readmissions  national  cost  $17  billion   annually • 76  %  considered  avoidable   17 Readmissions Congestive  Heart  Failure  (CHF) Source:  www.presidency.ucsb.edu,  cdc.gov,  tmz.com
  18. 18. Patient Class Labels No   readmission Readmission CHF  ROR:  30-­‐Day  Hospital  Readmission  Risk   Prediction Machine   Learning     Algorithms 18 Building   the   model Scoring   the   tuple Features Vector Features Vectors New  patient No  readmission Readmission
  19. 19. 19 Some of the Steps Data   Understanding And  Integration Data   Cleaning Data   Transformation Extracting    data  from  Epic  -­‐ 16  data  marts  and  200  views: Heart Failure  Inpatient  Summary Encounter.Flowsheet PatientEncounterHospital vs  
  20. 20. Public  Data: State  Inpatient  Dataset  2009-­‐2012 20 AGE ZIP RACE ATYPE NCHRONIC LOS FEMALE   DXCCS1 PRCCS1 TOTCHG 52 98122 1 3 12 3 0 153 212 56,511 87 98109 1 3 7 1 1 162 -­‐ 12,687 26 98028 4 3 1 30 1 139 195 127,300 • Washington  State  Inpatient  Data • Admission  level  Claims   • ~400  attributes   • Demographics • ICD9  Diagnosis  codes • ICD9  Procedure  codes • Charges • Admissions  by  year • 2009  – 652702 • 2010  – 651783 • 2011  – 648079 • 2012  – 648092
  21. 21. Variety  and  Volume  (2/3  V’s  of  Big  Data) Pre  Admission Post  Admission Pre-­‐ Discharge Discharge -­‐ Demographics -­‐ Vital  Sign -­‐Prior  Hospitalization Pulse  rate             Blood  pressure   Respiration  rate   BMI Number  of    prior  admissions Prior  length  of  stay + Demographics Sodium  level Glucose  level Hemoglobin  level Creatinine  level Hematocrit  level Neutrophils  level Ejection  Fraction   BUN  level + Vital  Sign + Prior  Hospitalization -­‐ Lab  Test + Vital  Sign + Prior  Hospitalization + Demographics +  Lab  Test -­‐ Diagnosis  Information Number  of  secondary  diagnosis Chronic  systolic  heart  failure   Acute  kidney  failure     Chest  pain Hyper  potassemia   Bronchopneumonia Other  chronic  pulmonary  heart  diseases   Syncope  and  collapse        … + Prior  Hospitalization + Demographics -­‐ Comorbidities Acute  coronary  syndrome    Asthma COPD    Ulcer    Dialysis    Dementia Arrhythmias    Mal  Nutrition   Vascular    Depression -­‐ Discharge/Admit  codes Admit  /Discharge  type Severity  Of  illness    Risk  Of  Mortality   -­‐ Utilization  Information Operating  room  CTSCAN Emergency  Room        CCU Marital  status          Age Racial  group       Gender
  22. 22. (Dec  2012)  Initial  Models   22 Data  integration Feature  Construction Predictive  modeling • Logistic  Regression • Naïve  Bayes • Support  Vector  Machines 0.6 0.72 0.64 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 Yale  M odel  (Com parative  …Am arasingham  et  al.   Our  current  Result Area  Under  the  Curve  (AUC) Several  Rejects:   KDD  Industry  Track   2013 AMIA  2013 JAMIA  2013 2012
  23. 23. (July  2013)  (much  better)   &  Some  Papers § Improved  data  exploration § S.-­‐C. Chin, K. Zolfaghar, S. Basu Roy, A. Teredesai, and P. Amoroso, "Divide-­‐n-­‐ Discover -­‐-­‐ Discretization based Data Exploration Framework for Healthcare Analytics," 7th International Conference on Health Informatics (HEALTHINF Short Paper), Angers, France, 2014 § N. Meadem, N. Verbiest, K. Zolfaghar, J. Agarwal, S.-­‐C. Chin, S. Basu Roy, A. Teredesai, D. Hazel, P. Amoroso, and L. Reed, "Predicting Risk of Readmission for Congestive Heart Failure Patients," Workshop on Data Mining for Healthcare (DMH), Chicago, IL, 2013 23 0.6 0.72 0.64 0.74 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Yale  Model   (Comparative   Baseline) Amarasingham   et  al.   Our  2012  Result Our  current   Result Area  Under  the  Curve  (AUC) §Improved  Modeling Effort
  24. 24. (Dec  2013)  Prototype  or  a  possible  Product?   &  yes,  More  Papers § Successful  Deployment 24 §K. Zolfaghar, J. Agarwal, D. Sistla, S.-­‐C. Chin, S. Basu Roy, and N. Verbiest, "Risk-­‐O-­‐Meter: An Intelligent Clinical Risk Calculator," 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL, 2013 §Kiyana Zolfaghar, Naren Meadem, Ankur Teredesai, Senjuti Basu Roy, Si-­‐Chi Chin, Brian Muckian: Big data solutions for predicting risk-­‐of-­‐readmission for congestive heart failure patients. BigData Conference 2013: 64-­‐71
  25. 25. 25 Multi  Layer  Classifier  :  Automatically  Detecting   Classification  Windows Will  patient ever readmit? Will  patient readmit within 30  days? YES NO YES NO KNN LR NB SVM KNN 32%  of  all  data Only 5%  of  patients that return within 30  days is  filtered out
  26. 26. Generalizing  the  30,60,90  Day  readmission § Automatic  design  of  time  prediction  hierarchy § Feature  selection  and  factor  analysis  at  each  layer § Different  classification  algorithms  in  each  layer  and  satisfying  different   quality  metrics 26
  27. 27. Automatic  design  of  prediction  hierarchy 27
  28. 28. Simple  3  Layer  Example • Stage  1:  Design  a  predictive  model  for  the  patients  who  are  likely  to   come  back  within  a  time  window  of  (X,  K),  where  X  is  the  maximum   number  of  days  until  next  readmission • Stage  2:  Design  a  predictive  model  for  time  window  of  (K,  30) • Stage  3:  Design  a  predictive  model  for  time  window  of  <30  days  of   readmission HOW  TO  AUTOMATICALLY  DETECT  THE  MIDDLE  CUTPOINT  K? 28
  29. 29. Hill  Climbing  Algorithm  to  Detect  K § Generate  a  random  number    K  between  X  and  30 § Compute   C1=  Centroid(X,K)  ,  C2=  Centroid(K+1,30) § Compute  the  KLCurrent =  KLDiv(C1,C2) § K’=K+i K”=K-­‐i § Find  a  point  K2  between  (K’,K’’)  ,  and  check § If  KLDiv(  Centroid(X,K2),  Centroid(K2,30))  >  KLCurrent § If  the  above  condition  is  satisfied,  then  K=K2 § KLCurrent =  KLDiv(  Centroid(X,K2),  Centroid(K2,30))   § Repeat  the  above  steps  until  no  further  check  is  possible 29
  30. 30. 30 Calculating  the  Probability  of  30  day  RoR P(readmit ≤ 30) = P(≤ 30 |≤ K)× P(≤ K |Y)P(Y)
  31. 31. Risk-­‐O-­‐Meter Distinguishing  Features 31 Risk-­‐O-­‐Meter Users Current  Systems Healthcare  provider and  Patients Only   healthcare  providers Result  explanation and  exploration Need  deep  domain   Knowledge Handle  incomplete  patient   input
  32. 32. All  in  one  Package  – Risk-­‐O-­‐Meter  (KDD  2013) 32
  33. 33. Pre  Admission Post  Admission Pre  -­‐ Discharge Discharge Post-­‐Discharge   Care   Management   Pipeline “White  Gap”PCP HF  Service Care   Management Payer ChroniRisk Continuous  Readmission  Risk  Assessment  Across  Continuum  of  Care* 78%* 42%* Service  Line  EMRPCP  Tools Psycho-­‐social  risk   scoring 2013  HF  Readmission  Statistics • 7.1  M  Readmits • 5.3  M  Avoidable • $13,000  each • $13  B  opportunity  cost Patient  Encounters  Scored +18,000 (HF  cohort)
  34. 34. Risk  – Done Cost  – Done Next?   Actionable  Interventions If  we  can  predict  can  we  recommend? 34 A  Framework  to  Recommend  Interventions  for  30-­‐Day  Heart  Failure  Readmission  Risk,  Rui Liu,  Kiyana Zolfaghar,  SC  Chin,  Senjuti Basu Roy,  Ankur  Teredesai,  Data  Mining  (ICDM),  2014  IEEE  International  Conference   on  DOI:  10.1109/ICDM.2014.89  Publication  Year:  2014  ,  Page(s):  911  -­‐ 916
  35. 35. A  real  and common Chronic  Readmission 75-­‐year  old,  female Chronic  pulmonary  disease,   depression,  hypertension and  diastolic  heart  failure   High Risk Medium Risk Low Risk 35 Readmit! Intervention  Plan  1 Major  Operating  Room,  Chest  X-­‐ray  and  others Intervention  Plan  2 Echocardiology,  CCU  and  others Intervention  Plan  3 Emergency  Room  and  others
  36. 36. Risk  will  be   lower  when  the   interventions   are  performed The  patient  is   not  readmitted Intervention  Rule  Generation Readmission Age Gender Pneumonia DX486 Acute respitory failure DX51881 CHF DX4280 Cont inv mec ven <96 hrs PR9671 Venous cath NEC PR3893 Packed cell transfusion PR9904 Rule   Repository Valid  Rule 1 Female, Diabetes,  Major  Operating  Room,   Chest  X-­‐ray  and  others Valid  Rule 2 Male, Hypertension, Echocardiology,  CCU  and   others Invalid Rule 3 Female,  Depression,  Emergency  Room  and   others Invalid  Rule  4 Male,  COPD,  Emergency  Room  and  others 36 Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation Compute patient risk using only non-­‐ procedural attributes Compute patient risk using procedural attributes Compare the difference between the two probabilities Store the rules where the risk is reduced after introducing the procedures
  37. 37. Recommendation  for  New  Patient Intervention  Plan  1 Major  Operating  Room,  Chest  X-­‐ray  and  others Intervention  Plan  2 Echocardiology,  CCU  and  others Intervention  Plan  3 Emergency  Room  and  others Top 3 intervention plans Rule  Repository New  Patient  Attributes Summarized  Intervention  Plan Major  Operating  Room,  Echocardiology ,  Chest   X-­‐ray  and  others 37 Summarize The Rule Repository is  HUGE!  (over   30k  rules) Parallel Solution! Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation Compute similarity between established attribute profile and a given patient profile Identify rules where the established attribute is most similar to the patient input Recommend interventions extracted from the established rules
  38. 38. Validation  – Data  Highlights • State  Inpatient  Database  (SID) of  Washington  State  heart  failure  cohort  in  year  2010   (67967  patients) for training and 2011 (52021 patients)  for  testing • 3908  diagnosis  and  2049  procedure  codes  are  involved. • Feature  Selection  is  performed  using  chi-­‐square  test. Demographics Age,  Gender,  Race Comorbidity  &  Diagnosis 21  comorbidities  and  90  diagnosis Utilization  &  Interventions 21 health  service  utilization  flags  and  70  interventions Others Length of  Stay,  #  of  diagnosis  and  interventions 38 High Dimensional Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation Extract patients from the test set who were not readmitted within 30 days Compute the evaluation metrics between the recommended interventions and the actual interventions
  39. 39. Validation – Experiment Results 39 0 100 200 300 400 Linear   Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid Hits 0.34 0.35 0.36 0.37 0.38 0.39 0.4 Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid Jaccard  Index 0.93 0.932 0.934 0.936 0.938 0.94 0.942 Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid Accuracy 0.45 0.5 0.55 0.6 0.65 Linear  Regression Hill-­‐Climbing Grow-­‐Shrink Hybrid True  Positive  Rate Bayesian Network Construction Intervention  Rule   Generation Intervention   Recommendation Evaluation
  40. 40. Back  to  the  Chronic  Readmission  Case 75-­‐year  old,  female Chronic  pulmonary  disease,   depression,  hypertension and  diastolic  heart  failure   40 No-­‐readmit! Cardiac  catheterization  lab,  CT  scan,  echo-­‐ cardiology,  echo-­‐cardiogram,   Cardiac  catheterization  lab,  CT  scan,  echo-­‐ cardiology,  echo-­‐cardiogram
  41. 41. Accountable  Care  Organizations Cost/Charge  Prediction 41 HealthSCOPE:  An  Interactive  Distributed  Data  Mining  Framework  for  Scalable  Prediction  of  Healthcare  Costs  ,  Marquardt  James,  Newman  Stacey, Hattarki Deepa,  Srinivasan Rajagopalan,  Sushmita Shanu,  Ram  Prabhu,  Prasad  Viren,  Hazel  David,  Ramesh  Archana,  De  Cock  Martine,  Teredesai  Ankur,   IEEE  Data  Mining  Conference  Demo  Track,  2014  IEEE  International  Conference  on  DOI:  10.1109/ICDMW.2014.45  Publication  Year:  2014  ,  Page(s):  1227  -­‐ 1230
  42. 42. 42 What  are  healthcare   costs  for  assigned   population  in  2015  ? Why  is  the  cost  so   high  or  low  ? How  does  the  cost   distribute  across   demographics  ? QUESTIONS DATA   SCIENCE DATA APPLICATIONS Motivation:   ACO  Cost  Prediction Demographics Diagnosis   Codes Procedure   Codes Drugs Lab  Results Clinical Claims Sources  :  SID,  OSHPD,  MEPS Source  :  MultiCare  Collaboration Charges Vitals Population Predictive   Modeling Feature  Prioritization Health  Prediction Care  Management Individual Predictive   Modeling Chandola et.  al,  KDD  2013  
  43. 43. Cost/Charge  Prediction:  Problem  Description • Goal  à predict  the  future  healthcare  cost  of  individuals  based  on   their  past  medical  and  cost information. • Supervised  machine  learning  problem. • Input: • Previous  health  information  (e.g.  diagnosis,  comorbidities,  etc).   • General  demographics  (age,  gender,  race) • Previous  healthcare  cost • {X}  =  (x1,  x2,  x3 ......xp) • Output: • Y  =  Future  healthcare  cost foo 43
  44. 44. foo 44 Four  Scenarios  for  predicting  cost   • Three  Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Nine  months  (1Q) • Six Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Six  months  (2Q) • Nine Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Three  months  (3Q) • Twelve    Months  of  Historical  data  (Medical,  Demographic  and  Cost) à Cost  of  Following  Twelve    months  (4Q)
  45. 45. Non-­‐ Gaussian  Distribution  of  Healthcare  Costs foo 45 Makes  it  challenging  and  interesting  problem  for  research
  46. 46. Existing  Cost  prediction  Methods • Limited  to  Rule  based  or  Multiple  Linear  Regression  methods • Rule  Based  methods   • Requires  domain  knowledge • Expensive • Multiple  Linear  Regression • Multi-­‐collinearity Issue • Sensitive  to  extreme  values  (outliers) • Evaluation • Estimate    the    mean    cost    of    the    given    sampling    distribution. • Often  in-­‐sample  data  used  to  report  predictive  performance. • R2   evaluation  metric (not  a  true  indicator)
  47. 47. Our  Contributions • Investigate  the  utility  of  state-­‐of–the  –art  machine  learning     algorithms  for  the  cost  prediction  problem.   • We  empirically  evaluate  three  algorithms: • Regression  Trees • M5  Model  Trees • Random  Forest foo 47
  48. 48. Regression  Tree 48 Age  >  60? Has   Asthma? Gender  =   Female? 21,00046,00062,00085,000 Yes Yes Yes No No No
  49. 49. M5  Model  Tree foo 49 Has   Asthma? Gender  =   Female? Yes Yes Yes No No No Age  >  60?
  50. 50. Random  Forest 50 Had   Procedure   X? Age  >  18? Gender  =   Male? 21,00046,00062,00085,000 Yes Yes Yes #  Admits   >  3? No No Race  =   White? Has  CHF? 21,00046,00062,00085,000 Yes Yes YesNo No No NoAge  >   60? Has   Asthma? 21,000 Gender  =   Female? 46,00062,00085,000 Yes Yes YesNo No No
  51. 51. 51 Evaluation  Metrics • Mean  Absolute  Error  (MAE) • Root  Mean  Squared  Error  (RMSE)
  52. 52. 52 MAE  Results  – SID  Data  (3Q  Scenario) 0 5,000 10,000 15,000 20,000 25,000 30,000 Average   Baseline Previous   Cost   Regression Multiple   Linear   Regression Regression   tree Random   Forest Model  Tree MAE  ($) Baselines Advanced  Models
  53. 53. 53 MAE  Results  – MEPS  Data 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Average   Baseline Previous   Cost   Regression Multiple   Linear   Regression Regression   tree Random   Forest Model  Tree MAE  ($) Baselines Advanced  Models
  54. 54. 54 Prediction  Error  Results  – M5  Model  Trees 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 1Q 2Q 3Q 4Q Error  ($) MAE RMSE
  55. 55. Error  Distribution:  WA  State  SID  Data foo 55 For  large  fraction  of  of  the   population  (75%),  we  were  able  to   predict with    higher    accuracy    using    these     algorithms 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0% 25% 50% 75% Maximum  Prediction  Error  ($) Portion  of  Population Multiple  Linear   Regression Regression  Tree Random  Forest Model  Tree
  56. 56. Sub-­‐Population  Cost  Prediction Prediction Prediction Prediction Population Sub-­‐Population Future Healthcare Cost Congestive  heart  failure  (CHF) Diabetes COPD Asthma Coronary  artery  disease  (CAD) Age  65+
  57. 57. Most  difficult  cohort  to  predict foo 57 0 5000 10000 15000 20000 25000 30000 35000 Asthma Diabetes CHF COPD Coronary Over  65 MAE  ($) model  trees linear  regression
  58. 58. Engineering  the  Solutions:   Risk-­‐O-­‐Readmission  &  Cost-­‐As-­‐a   Service 58
  59. 59. Thu,  Nov  7,  2013  at  10:50  AM 59 -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ Forwarded  message  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ From:  Windows  Azure  Pass  System  Admin  <wapadmin@microsoft.com> Date:  Thu,  Nov  7,  2013  at  10:50  AM Subject:  Gifting  Letter  for  Windows  Azure  Research  Pass To:  "Ankur  M.  Teredesai"  <ankurt@uw.edu> Cc:  "Azure4Research  (RFP  External)"  <azurerfp@microsoft.com> Dear  Ankur  M.  Teredesai  , We  have  approved  your  application  for  a  Windows  Azure  Research  Pass  Grant.  In   order  to  receive  your  pass,  download  the  Microsoft  gifting  letter  from  the  following   link:
  60. 60. Risk-­‐of-­‐Readmission  as  a  Service 60
  61. 61. Web  App   for  ACOs Model   Selector Cost  Prediction  API Beneficiary    Claims Population  Batch/Individual A B Linear  Regression Regression  Trees Individual  Beneficiary Feature  Vector Individual  Beneficiary Predicted  Cost Predicted,  Previous  year,  Historic     population  Costs  +  population  statistics ④ ① ② ③ Scale  Issues: Cost  Prediction  as  a  Service R Big  Data  Stack Cost  Prediction  Engine Model  Bank  deployed  on   ADAPA Spark Beneficiary    Claims  for  individual ① Predicted cost  for  the  individual ④ Web App   for Individual WA-­‐SID  Claims  /  MEPS   Survey  (for  training) Data  Sources C M5  Model  Trees
  62. 62. Web  App   for  ACOs Model   Selector Cost  Prediction  API Beneficiary    Claims Population  Batch/Individual A B Linear  Regression Regression  Trees Individual  Beneficiary Feature  Vector Individual  Beneficiary Predicted  Cost Predicted,  Previous  year,  Historic     population  Costs  +  population  statistics ④ ① ② ③ Cost  Prediction  as  a  Service R Big  Data  Stack Cost  Prediction  Engine Model  Bank  deployed  on   ADAPA Spark Beneficiary    Claims  for  individual ① Predicted cost  for  the  individual ④ Web App   for Individual Data  Sources WA-­‐SID  Claims  /  MEPS   Survey  (for  training) C M5  Model  Trees
  63. 63. Apache  Spark foo 63 Apache  Spark HDFS Slave  1 Slave  1 Master Driver RDD In  Memory  Data Partition  1 In  Memory  Data Partition  2 Spark Spark Spark Data  Partition1 Replica  Data   Partition2 Data  Partition2 Replica  Data   Partition2
  64. 64. Weighted  k-­‐NN  for  Regression foo 64 Data   Partition  1 kNN1 Predicted  Cost kNN2 2k  NN kNN Node  1 Data   Partition  2 Node  2 Test   Instance Top  k Group   &  Sort Group  &  Sort Weighted   Average Compute   kNN Compute   kNN
  65. 65. Rough  Set • Rough set theory is an ML framework that is especially suitable for information systems with inconsistencies. • Rough set theory handles discrete attributes. • Lower approximation: instances that necessarily belong to the class • Upper approximation: instances that possibly belong to the class Patient Age  ≥  50 Alcohol  Disorder  Visit Cost P1 Yes Yes High P2 Yes Yes High P3 Yes No Low P4 Yes No High P5 No No Low P6 No Yes High Similar  Patients  but  belong  to   different  classes!
  66. 66. Fuzzy  Rough  Set • Uses  fuzzy  logic  to  handle  continuous   attributes. • Similarity  matrix  contains  values   between  0  and  1.   • Inconsistent  instances  are  highly   related  but  have  a  different  class. Patient Age Alcohol  Disorder  Visits   Cost P1 52 1 $13335 P2 59 4 $277966 P3 55 0 $8139 P4 50 0 $66058 P5 34 0 $5815 P6 26 1 $38526 P1 P2 P3 P4 P5 P6 P1 1 0.52 0.83 0.84 0.60 0.61 P2 0.52 1.00 0.44 0.36 0.12 0.13 P3 0.83 0.44 1 0.92 0.68 0.44 P4 0.84 0.36 0.92 1 0.76 0.51 P5 0.60 0.12 0.68 0.76 1 0.75 P6 0.61 0.13 0.44 0.51 0.75 1
  67. 67. Fuzzy  Rough  Set • Let rj,i be the degree of similarity of instances i and j. • Let ci be the degree to which instance i belongs to the class. • Then the degree to which instance j belongs to the: • Lower approximation of the class is: min{max(1-rj,,i, ci) | i = 1,...,n} • Upper approximation of the class is: max{min(rj,i, ci) | i = 1,...,n} • Current implementations can handle only up to 100,000 instances because they keep the similarity matrix in memory.
  68. 68. Fuzzy  Rough  Set   max{min(rj,i, ci) | i = 1,...,n}
  69. 69. Fuzzy  Rough  Set min{max(1-rj,,i, ci) | i = 1,...,n}
  70. 70. Implementation • The construction of the similarity matrix can be done in a parallel manner, making each of K compute nodes calculate n/K columns of the similarity matrix. • No need to store the similarity matrix as a whole. • The construction of the similarity matrix does not have to be finished before (partial) computation of the lower and upper approximations can begin. Node  1 Node  2
  71. 71. Implementation  -­‐ Lower  Approximation Upper  Approximation
  72. 72. Spark  vs MPI               Fuzzy  Rough  Set
  73. 73. Web  App   for  ACOs Model   Selector Cost  Prediction  API Beneficiary    Claims Population  Batch/Individual A B Linear  Regression Regression  Trees Individual  Beneficiary Feature  Vector Individual  Beneficiary Predicted  Cost Predicted,  Previous  year,  Historic     population  Costs  +  population  statistics ④ ① ② ③ Cost  Prediction  as  a  Service R Big  Data  Stack Cost  Prediction  Engine Model  Bank  deployed  on   ADAPA Spark Beneficiary    Claims  for  individual ① Predicted cost  for  the  individual ④ Web App   for Individual WA-­‐SID  Claims  /  MEPS   Survey  (for  training) Data  Sources C M5  Model  Trees
  74. 74. Readmission  Application • Android • Windows  Phone • Patient  View • what  is  my  risk • Doctor  View   • who  are  my  risky  patients? • alerts • Interventions 74
  75. 75. foo 75 http://healthscope.cloudapp.net/hscope-­‐dev/aco/ Healthcare  Scalable  COst  Prediction  Engine  (HealthSCOPE)
  76. 76. 0.6  AUC Yale  Model (Baseline) 76 Milestones:  Readmission  Risk 0.64  AUC UW  2012   Result Ensemble   method,   Hierarchical   classification Dec  2012 0.74  AUC UW  2014Result Lab  results + New   Algorithm   (Adaboost) Feb    2014 QlikView Readmission   App Dec  2013 Machine  Learning   Process  to  Target   New  Chronic   Diseases Aug  2014  -­‐>  Moving  Forward Integrating   care  pathway   March  2014 Bayesian   Network   Learning AUC  – Accuracy  measure   (Area  Under  Curve) Real  Time   Care   Factors  &   Pathways July  2014 with   EPIC Post-­‐Discharge (Clinical    data) June  2013 Risk-­‐o-­‐Meter Development +   Big  data  Efforts Pre-­‐Admission (Clinical    data) Post-­‐Discharge (Claim  data) Post-­‐Admission (Clinical  data) IEEE  Big  Data REF  #3 KDD REF  #1  &  2 HEALTHINF REF  #4  &  5 KDD REF  #6 ICDM  2014 REF  #6
  77. 77. Problem   Explorat ion 77 Milestones:  Cost  Prediction H-­‐SCOPE  I SID  Data June  2014 H-­‐SCOPE  IV SID  +  MEPS   data Nov.  2014 H-­‐SCOPE  III Adapa Scoring   Engine Spark   Framework Sept.  2014 Aug  2015  -­‐>  Moving  Forward H-­‐SCOPE  V Five  Cohort Dec.  2014 M5  Model   Trees Random   Forest Regression   Tress Health SCOPE  VI July  2015 Admit  Level August  2014 H-­‐SCOPE  II Population  View   (ACO) OSHPD  Data   Application Beneficiary   Level Beneficiary   View Four  Future   Scenario   ICDM  2014 KDD-­‐2015 AMIA-­‐2015 Sub-­‐ Population Deep Learning Time  & Cost  Of Hospital readmission H-­‐SCOPE  VII AHRQ  Private data WWW-­‐Digital Health-­‐2015 Time,  Cost And   Illness  (Alignment) Prediction  
  78. 78. 78 AUC  – Accuracy  measure   (Area  Under  Curve) 2012 78 Milestones:  Merging  Threads 2016  and  beyond2013 2014 2015 Risk  of  Readmission  (Clinical,  Sociological  &  Claims) 2014 2015 Cost  Prediction  (Claims  and  secondary  data  sources) 2015 Risk  &  Cost  Convergence
  79. 79. Flat  Files  CSV Claims  X12 Clinical    HL7 Distance  Compute  Library Instance  Selection   RNGE Drop  3 Fuzzy  Rough  Set   Approximation CHF  Risk  of   Readmission Geo   Routing Random  Forests KNN Industry  Partners  and  Domain  Experts Other   Solutions HDFS NUMA MPI Grappa Census  US  Gov Unstructured  CCD Bayesian   Networks Support  Vector Machines 79 Cost  of  Chronic   Interventions Age/Gender   Prediction Malware   Analytics Personalized   Cancer  Therapy ETL  Tools Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry) Sqoop
  80. 80. Flat  Files  CSV Claims  X12 Clinical    HL7 Distance  Compute  Library Instance  Selection   RNGE Drop  3 Fuzzy  Rough  Set   Approximation Personalized   Cancer  Therapy Geo   Routing Random  Forests KNN Industry  Partners  and  Domain  Experts Other   Solutions HDFS NUMA MPI Grappa Census  US  Gov Unstructured  CCD Bayesian   Networks Support  Vector Machines 80 Cost  of  Chronic   Interventions Age/Gender   Prediction Malware   Analytics CHF  Risk  of   Readmission ETL  Tools Raw  Data  from  Sources    (SID,  OSHPD,  HCUP,  Edifecs,  MHS,  CMS,  LINCS,  Industry) Sqoop
  81. 81. 81 Our  Sincere  Thanks for  Your  Support!

×