Your SlideShare is downloading. ×
Real-time Analytics for the Healthcare Industry: Arrythmia Detection- Impetus Article
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Real-time Analytics for the Healthcare Industry: Arrythmia Detection- Impetus Article

460
views

Published on

It is time for the healthcare industry to move from the era of “analyzing our health history” to the age of “managing the future of our health.” An article by Impetus Big Data experts illustrates the …

It is time for the healthcare industry to move from the era of “analyzing our health history” to the age of “managing the future of our health.” An article by Impetus Big Data experts illustrates the importance of real-time analytics across the healthcare industry by providing a generic mechanism to reengineer traditional analytics expressed in the R programming language into Storm-based real-time analytics code.

The article was featured in the inaugural issue of the Big Data Journal launched at Strata Conference 2013, CA.

Further details:

Authors- Vijay Srinivas Agneeswaran, Joydeb Mukherjee, Ashutosh Gupta, Pranay Tonpay, Jayati Tiwari, and Nitin Agarwal

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
460
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. INDUSTRY EXPERIENCE REAL-TIME ANALYTICS FOR THE HEALTHCARE INDUSTRY: ARRHYTHMIA DETECTION Vijay Srinivas Agneeswaran, Joydeb Mukherjee, Ashutosh Gupta, Pranay Tonpay, Jayati Tiwari, and Nitin Agarwal Impetus Infotech Private Limited, Bangalore, Karnataka, India Abstract It is time for the healthcare industry to move from the era of ‘‘analyzing our health history’’ to the age of ‘‘managing the future of our health.’’ In this article, we illustrate the importance of real-time analytics across the healthcare industry by providing a generic mechanism to reengineer traditional analytics expressed in the R programming language into Storm-based real-time analytics code. This is a powerful abstraction, since most data scientists use R to write the analytics and are not clear on how to make the data work in real-time and on highvelocity data. Our paper focuses on the applications necessary to a healthcare analytics scenario, specifically focusing on the importance of electrocardiogram (ECG) monitoring. A physician can use our framework to compare ECG reports by categorization and consequently detect Arrhythmia. The framework can read the ECG signals and uses a machine learning-based categorizer that runs within a Storm environment to compare different ECG signals. The paper also presents some performance studies of the framework to illustrate the throughput and accuracy trade-off in real-time analytics. Introduction The healthcare industry is undergoing a major transformation. The old days of using paper records of patients’ data are gone with the digitization of healthcare information, starting with the use of electronic health records (EHRs). The use of EHRs is becoming widespread, partly dictated by financial stimulus and partly by governmental regulations. The healthcare industry is now turning to the use of data analytics. The pace is likely to pick up with the advent of the Affordable Care Act (ACA), or ‘‘Obamacare,’’ which promises to transform the healthcare industry from fee-for-service to fee-for-value. Moreover, due to the widening of the eligibility requirements and affordability, more people will come into the system for healthcare. This implies the need for big-data analytics, especially for the mandated health exchanges. The Affordable Care Act has also spurred many innovations in healthcare—this is evident in the number of healthcare startups funded recently, such as the following (this list is only indicative, not intended to be complete, and is biased toward health analytics): 1. Health catalyst, which provides analytics suite to analyze EHRs. 2. xG health solutions, which provides analytics of population health as well as reporting and interpretation. Editor’s Note: Impetus supports multiple venues for dialogue in big data, providing thought leadership and services to create new ways to analyze data to gain key opportunities in business and industry across enterprises. The following is a description of one potential application of their expertise in machine learning within the healthcare space. 176BD BIG DATA SEPTEMBER 2013 DOI: 10.1089/big.2013.0018
  • 2. INDUSTRY EXPERIENCE Agneeswaran et al. 3. Lumeris, which uses real-time analytics of healthcare data to improve patient care, essentially focused on making ACA work for all players including health systems, payers, and providers. 4. Eviti, which provides physicians with actionable information using analytics for cancer related decision making. 5. Humedica, which uses data from multiple sources including EHRs, claims data, etc. to help healthcare providers analyze patient data as well as population data. 6. HealthTap, which provides a social platform for physicians and patients to share information as well as build a peer reputation. comparison between the two common devices, the loop event monitoring and the mobile cardiac outpatient telemetry system, and their effectiveness in detecting arrhythmias. Machine Learning–Based Classification of ECG Data The classifier we have developed works in two modes: the training mode (or learning mode) and the operational mode (or advisory mode). In the training mode, we extract features (i.e., variables or transformed variables) in terms of which A number of startup accelerators include Nanthealth, Rockarrhythmia types, including its absence, can be represented health, Healthbox, and Blueprint Health Services, among others. and we learn the parameters of the inference mechanism about the occurrence or nonoccurThis article presents a different scerence of a type of Arrhythmia. In nario requiring real-time analytics of this mode, the results cannot advise ‘‘THE OLD DAYS OF USING big data, and as an example, applies the doctor, but rather, the input cutting edge big data technologies to about the label (i.e., type of arPAPER RECORDS OF historical data. The electrocardiorhythmia or absence of it) correPATIENTS’ DATA ARE GONE gram (ECG) signal provides critical sponding to each record provided is WITH THE DIGITIZATION OF information about the heart activity used for training (see Fig. 1). HEALTHCARE INFORMATION, of a patient. Continuous monitoring of ECG is important when a patient Once the training is complete, the STARTING WITH THE USE is ambulatory or at the bedside. It is classifier goes into operational OF ELECTRONIC HEALTH very important to treat arrhythmic mode, meaning it begins advising RECORDS’’ patients on time, as delays can lead the doctor on new, unseen, but to potentially fatal complications.1 similar cases to those seen during training. The doctor arrives at an Arrhythmia detection from ECG inference about the presence or absence of arrhythmia taking signals is a well-studied problem. For instance, Gao et al.1 the output of the classifier into consideration. Also, if arsolve it by using an artificial neural network approach based rhythmia is present, which type it is can be suggested by the on a Bayesian framework. Rothman and colleagues2 make a FIG. 1. ML Based Classification of ECG Data: Training Mode. MARY ANN LIEBERT, INC. VOL. 1 NO. 3 SEPTEMBER 2013 BIG DATA BD177
  • 3. ARRHYTHMIA DETECTION: REAL TIME ANALYTICS Agneeswaran et al. classifier. The various types of arrhythmia classes (labels) will be listed in a subsequent section. This mode of operation is depicted in Fig. 2. The input to machine learning algorithm is a set of historic patient records. Clinical measurements recorded in the past from ECG signals, namely, QRS duration, RR, P-R, Q-T intervals constitute such records, along with information such as gender, age, and weight. This data is padded with the categorical label a cardiologist had assigned to each record, such as ‘‘normal’’ or one of the 15 types of pathology categories. These make up a total of 279 features as enumerated by Guvenir et al.3 Class names and description Class distribution: Database: Arrhythmia Class code: Class: 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 Normal Ischemic changes (Coronary Artery Disease) Old Anterior Myocardial Infarction Old Inferior Myocardial Infarction Sinus tachycardy Sinus bradycardy Ventricular Premature Contraction (PVC) Supraventricular Premature Contraction Left bundle branch block Right bundle branch block 1. degree AtrioVentricular block 2. degree AV block 3. degree AV block Left ventricule hypertrophy Atrial Fibrillation or Flutter Others FIG. 2. 178BD Number of instances: 245 44 15 15 13 25 3 2 9 50 Description of dataset We analyzed a dataset containing 452 records belonging to patients coming from different age groups, weights, heights and gender (see http://archive.ics.uci.edu/ml/datasets/ Arrhythmia for more information). There are in all 280 variables, including various arrhythmia class types as the 280th column, in the database downloaded from the source.1 Values for this column can be 1 to 16, representing one of the codes as enumerated above. There are 5 categorical variables and 274 numeric variables. Five variables had missing values in their records as enumerated below. These variables occurred in columns 11 to 15 in the original dataset. Vector angles in degrees on front plane of: 11 T 8 values missing 12 P 22 values missing 13 QRST 1 value missing 14 J 376 values missing Number of heart beats per minute 15 Heart rate 1 value missing Some of the variables had ‘‘0’’ throughout the column (i.e., across all records). Those variables are enumerated below with their column number followed by the variable name 20 70 132 0 0 4 5 22 140 144 ‘‘DI S-prime Wave’’; 68 ‘‘AVL S-prime Wave’’ ‘‘AVL Existence of ragged R wave’’; 84 ‘‘AVF Existence of ragged P wave’’ ‘‘V4 Existence of ragged P wave’’; 133 ‘‘V4 Existence of diphasic derivation of P wave’’ ‘‘V5 S-prime Wave’’, 142 ‘‘V5 Existence of ragged R wave’’ ‘‘V5 Existence of ragged P wave’’; 146 ‘‘V5 Existence of ragged T wave’’ ML Based Classification of ECG Data: Operational Mode. BIG DATA SEPTEMBER 2013
  • 4. INDUSTRY EXPERIENCE Agneeswaran et al. 152 158 205 275 ‘‘V6 S-prime Wave’’ ; 157 ‘‘V6 Existence of diphasic derivation of P wave’’ ‘‘V6 Existence of ragged T wave’’; 165 ‘‘DI Amplitude S-prime Wave’’ ‘‘AVL Amplitude S-prime Wave’’; 265 ‘‘V5 Amplitude S-prime Wave’’ ‘‘V6 Amplitude R-prime Wave’’ The first four columns in the original datafile had non-ECG variables as follows: 1 2 3 4 Age: Age in years , linear Sex: Sex (0 = male; 1 = female) , nominal Height: Height in centimeters , linear Weight: Weight in kilograms , linear Classification algorithm We chose the random forest (RF) classifier2 for several reasons: it is fast (training time); its OOB-error (out-of-bag errors) is a good estimate for generalization error; it can handle noisy data; it can suggest ‘‘important variables,’’ using which, a parsimonious predictive model can be built; and it has an imputation method associated with it which at times is better choice than using any other external methods for imputation. Additionally, two or more separately trained RFs can be combined without incurring much computational expenditure, and it is an ensemble classifier (i.e., a collection of classifiers), which predicts by counting votes cast by each classifier for a class on a query record. Predictive performance of an ensemble classifier is better than any of its constituents. The constituent classifiers for RF are classification trees. The advantage of using such classifiers is that individual classifiers may be barely accurate (slightly better than random guessing) but combining trees may produce classifiers with much higher accuracy. Also, a great deal of variance may be present as we move from one tree to another, but the overall classifier’s variance is reduced because of averaging that takes place in the course of ‘‘ensembling.’’ RF is trained by bagging (bootstrapped aggregation) of training data. Random samples ‘‘with replacement’’ are drawn from the training data and classification trees are built using them. If large numbers of trees are constructed (1–1/e) 63% of the original data are used therein, the remaining 36% are used for testing the trees to calculate OOB-errors. It can be shown that this error is a fair indication of generalization error for the RF classifier. Generalization error measures predictive performance of classifiers when tested with unseen data outside of the training set but supposedly generated from the same distribution as that of the training data. These will be the kind of data encountered by the classifier in the operational mode. The keys to the predictive performance of RF classifier are the strength of individual classifiers and the diversity (degree of uncorrelatedness) of constituent classification trees in the forest in terms of raw margin functions.4 Imputation of missing values In the exploratory data analysis (EDA) phase, it was found that important variables such as ‘‘heart rate’’ as measured in MARY ANN LIEBERT, INC. VOL. 1 NO. 3 SEPTEMBER 2013 BIG DATA number of heart beats per minute, had some missing values. A couple of imputation algorithms were tried out,5,6 and finally rfImpute from the randomForest package was chosen to impute those missing values. Amelia7 was not considered because it could produce imputed results only with a high value of prior information [with ‘‘empri’’ parameter value as high 0.9*nrow(data), when usually 0.01* nrow(data) is used]. The latter amounts to adding lot of artificial observations with the same mean and variance of existing observations but with 0 covariance. Imbalance of data with respect to classes The gross imbalance in the dataset (Table 1) poses problems for selecting a subset of data to be used for training and testing. If the training and testing sets are typically partitioned (70%–80% for training and 30%–20% for testing), classification performance will be misleading. There are several ways to partially address this problem. Generating artificial data for the minor classes (via SMOTE algorithms and associated packages)8 is one method. Another means is to down-sample data from the major class. We have chosen the latter path [i.e., subsampling the major class (Normal class) in proportion to the minor class (those classes that had at least 10% data)]. While subsampling the major class, we made sure that its maximum number did not exceed 100% that of the minor class. Furthermore, weights were used for the training examples supplied to the RF classifier. Classes that had single-digit representation namely, Left ventricule hypertrophy (0.9%), Atrial Fibrillation or Flutter (1.1%), Ventricular Premature Contraction (PVC) (0.7%), Supraventricular Premature Contraction (0.4%), and Left bundle branch block (1.9%) were not addressed. Variable selection for model building Variable Selection plays a major role in the development of predictive models. In this study, one of the reasons for selecting RF classifier over other alternatives was that it has a means of assessing the effectiveness of each variable occurring in the model, using which we can build a parsimonious model for the deployment. The criteria based on which RF ranks its ‘‘Important Variables’’ are ‘‘Mean Decrease Accuracy’’ and ‘‘Mean Decrease Gini.’’ We prefer the latter for selecting the important variables, because in some instances in the literature, it has been reported that the other measure is not stable.9 All variables with a Mean Decrease Gini value greater than its mean value will be retained in the model, in our case by setting the criterion threshold to its mean value (see Fig. 2). The complete variable list with descriptions is provided in the online reference (http://archive.ics.uci.edu/ml/machine-learningdatabases/arrhythmia/arrhythmia.names). Experimental Results and Discussion We performed experiments on the classifier we developed to assess its predictive performance. We enumerate the steps of the algorithm for classification using RF below: BD179
  • 5. ARRHYTHMIA DETECTION: REAL TIME ANALYTICS Agneeswaran et al. 1. Read comma-separated values of Arrhythmia data from text file as table. 2. Identify and create a response variable showing which class datapoints belong to (280th column of original data read as table). 3. Make sure data is complete: Identify the columns with missing values. Replace the missing values (occurring as ‘‘?’’) with NA (required for imputation). 4. Assign names of the Variables (for ease of identification). 5. Get rid of variables with zero entries, age, sex, height, and weight and the one specifying Arrhythmia Type (i.e., non-ECG values). (For imputation, we cannot afford to retain so many variables with so few records. One of the imputation methods used, Amelia, does not permit it.) 6. Perform Imputation with rfImpute/Amelia. 7. Sample imputed data judiciously (as described previously) from respective classes up to the maximum number of records it contains except for the Normal Class (code 01). For this class try out number of records 100, 90, 80, and 70. Toss a biased coin to generate indices between 1 and number of records (rows) in the ratio 70:30. Generate training and test set using above indices. 8. Call Random Forest with imputed data and number of tree = 500 and other parameters. 9. Call Predict function on the test set of data. 10. Identify the important variables according to the specified criterion (MeanDecreaseGini) at specified threshold value (Set equal to the Mean of MeanDecreaseGini). 11. Call Random Forest with important variables and training set of data and number of tree = 500 and other parameters. 12. Call Predict function on the test set of data. 13. Go back to step 7 until the list (100, 90, 80, 70) is exhausted. Table 1 shows the computation of precision and recall, which can be defined below as follows: recall ¼ Precision: the number of correctly classified examples of a particular class divided by the number of examples labeled by the system as belonging to that particular class.10 precision ¼ jfcorrect À labelsg fpredicted À labelsgj jcorrect À labelsj F-score: a combination of the above two measures in the form of harmonic mean. F-Score ¼ jfcorrect À labelsg fpredicted À labelsgj jpredicted À labelsj 2 · precision · recall precision þ recall As the system keeps operating in the field, more records for the various cases will be collected, together with the cardiologists’ decisions for the respective records. A new RF classifier may be trained with these data and finally it can be Recall (sensitivity): the number of correctly classified examples of a particular class divided by the number of examples of that particular class in the data. Table 1: Precision/Recall Computation Number of records Class 1 (precision, recall, Class major class f-score as defined below). 2 90 80 70 180BD 96.43 58.69 72.97 78.26 52.94 63.15 89.29 78.12 83.33 89.47 65.38 75.55 Class 3 Class 4 66.67 100.0 50.0 71.40 75.0 100.0 68.97 85.71 66.7 72.73 75.0 33.33 53.33 50.0 50.00 61.54 60.0 39.98 71.43 66.67 50.0 55.55 100 50.0 62.53 80.0 50.0 75.00 0.00 83.33 70.59 0.00 83.33 72.73 0.00 83.33 Class 5 Class 6 Class 9 33.33 100.0 49.96 33.33 100.0 49.96 50.0 100.0 66.67 33.33 100.0 49.96 85.71 75.00 79.99 87.5 87.5 87.5 75.00 90.00 81.82 80.00 88.80 84.17 100.0 100.0 100.0 66.67 100.0 79.99 100.0 100.0 100.0 100 100 100 Class With all variables With important variables 10 100-OOB-error 100-OOB-error 62.50 66.66 64.53 80.00 54.54 64.86 55.56 83.33 66.71 86.67 68.42 76.47 67.29 70.10 71.01 72.46 64.50 66.50 61.66 64.25 BIG DATA SEPTEMBER 2013
  • 6. INDUSTRY EXPERIENCE Agneeswaran et al. combined with the one currently operating incrementally using the combine() function of randomForest. Implementation of R-based Classifier for Real-Time Analysis ‘‘R’’ code can be executed from within a bash script, which allows us to invoke it from within a Java program (or any programming language or script for that matter). Storm is an open-source real time computation framework, which allows us to process streams of data in a parallel fashion making it a very good choice for classification of data on a cluster of nodes. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation. The model file created in the previous step is referenced in another ‘‘R’’ script, which is used for real-time classification. Data to run classification on enters the storm framework via a Spout which then emits it to the bolts. Each bolt runs ‘‘R’’ script in parallel and emits results of the classification (which can get captured and used as needed) as shown in Fig. 3. Note that for each result in Table 2, one node is a Nimbus node and the remaining are supervisors. Each node is an 8—has 8 quad-core CPUs, 32 GB of RAM, and 32 GB of swap space. Table 2. ECG Classification Performance Analysis Time taken (in seconds) Number of predictions (ECG categorizations) 20K 40K 0.1 million Sequential processing (no-Storm used) Storm cluster with 2 nodes (1 spout, 8 bolts) Storm cluster with 3 nodes (1 spout, 16 bolts) 3,600 7,200 18,300 900 1,710 4,440 450 900 2,400 Note: We made use of only one Spout for this POC. Depending on the mechanisms of data entry into Storm framework, it is possible to use multiple spouts, which would enhance performance further. Concluding Remarks This article has presented a real-time machine-learning platform for the healthcare domain that allows ECG signals to be classified. It is an additional input for the physician, but a crucial one that facilitates care-for-value. The implication is that this work provides the basis for building a powerful analytical framework that can work in real-time—this study could prove extremely useful, not only for ECG classification, but also for enabling physicians to get incremental analytics on various kinds of patient data increasingly available in the EHRs. Our study also enables incremental healthcare, where the focus can shift to analytics, and consequently, to customized real-time healthcare. The upcoming health exchanges may also benefit, as on-the-fly analytics on highvelocity data becomes essential for providers, physicians, and patients equally. Author Disclosure Statement All authors are employed by Impetus. References FIG. 3. Running R over Storm. MARY ANN LIEBERT, INC. VOL. 1 NO. 3 SEPTEMBER 2013 BIG DATA 1. Dayong Gao, Madden M, Chambers D, Lyons G. Bayesian ANN classifier for ECG arrhythmia diagnostic system: A comparison study. Proceedings of 2005 IEEE International Joint Conference on Neural Networks (IJCNN ’05) 2005; 4:2383–2388. 2. Rothman SA, et al. The diagnosis of cardiac arrhythmias: A Prospective multi-center randomized study comparing mobile cardiac outpatient telemetry versus standard loop event monitoring. J Cardiovasc Electrophysiol 2007; 8:1–7. 3. Guvenir HA, Acar S, Demiroz, G, Cekin A. A supervised machine learning algorithm for arrhythmia analysis. Comput Cardiol 1997;7:433–436. BD181
  • 7. ARRHYTHMIA DETECTION: REAL TIME ANALYTICS Agneeswaran et al. 4. Breiman L. Random Forests. Mach Learn 2001; 45:5–32. 5. Liaw A. Missing value imputations by randomForest. R documentation. Available online at http://rss.acs.unt .edu/Rdoc/library/randomForest/html/rfImpute.html. (Last accessed on September 6, 2013). 6. Ishioka T. Imputation of missing values for unsupervised data using the proximity in random forests. In: Proceedings of The Fifth International Conference on Mobile, Hybrid, and On-line Learning. Nice, France, February 24–March 1, 2013. 7. Honaker J, King G, Blackwell M. AMELIA II: A program for missing data. J Stat Softw 2011; 45:1–47. 8. Blagus R, Lusa L.SMOTE for high-dimensional classimbalanced data. BMC Bioinformatics 2013; 14:106. Available online at www.biomedcentral.com/1471-2105/ 14/106. (Last accessed on September 6, 2013). 182BD 9. Calle ML, Urrea V. Letter to the editor: Stability of random forest importance measures. Briefings Bioinf 2011; 1286–89. 10. Solokova M, Guy L. A systematic analysis of performance measures for classification tasks. Inf Process Manag 2009; 45:427–437. Address correspondence to: Vijay Srinivas Agneeswaran, PhD Innovation Labs Impetus Infotech India Private Limited Pritech Park SEZ, Bellandur Outer Ring Road Bangalore, Karnataka 560103 India E-mail: vijay.sa@impetus.co.in BIG DATA SEPTEMBER 2013