SlideShare a Scribd company logo
1 of 7
Download to read offline
https://iaeme.com/Home/journal/IJARET 776 editor@iaeme.com
International Journal of Advanced Research in Engineering and Technology (IJARET)
Volume 10, Issue 2, March - April 2019, pp. 776-782, Article ID: IJARET_10_02_073
Available online at https://iaeme.com/Home/issue/IJARET?Volume=10&Issue=2
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
DOI: https://doi.org/10.34218/IJARET.10.2.2019.073
© IAEME Publication Scopus Indexed
A CONCEPTUAL APPROACH TO ENHANCE
PREDICTION OF DIABETES USING
ALTERNATE FEATURE SELECTION AND BIG
DATA BASED ARCHITECTURE
Chellammal Surianarayanan, Sharmila Rengasamy
Bharathidasan University Constituent Arts and Science College,
Affiliated to Bharathidasan University,
Srirangam, Tiruchirappalli, Tamil Nadu, India
ABSTRACT
Machine learning algorithms play a vital role in prediction of many diseases such
as heart disease, diabetes, cancer, lung disease etc. The applicability of machine
learning algorithms to healthcare domain relieves the burden of physicians as it is
impractical to scan manually all the data collected over a period of time in order to
arrive at some valuable information. Machine learning algorithms learn from the
training dataset and they become capable of thinking like a human. Once the algorithm
completes it learning with training dataset, it can automatically predict the target output
label of any unseen data. In this work, predicting diabetes using machine learning
algorithms has been taken up. A conceptual architecture has been proposed based on
big data architecture.
Keywords: Diabetes prediction, big data based architecture for disease prediction,
prediction of diabetes using machine learning algorithms
Cite this Article: Chellammal Surianarayanan, Sharmila Rengasamy, A Conceptual
Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and Big
Data Based Architecture, International Journal of Advanced Research in Engineering
and Technology, 10(2), 2019, pp 776-782.
https://iaeme.com/Home/issue/IJARET?Volume=10&Issue=2
1. INTRODUCTION
Diabetes mellitus is chronic, a ceaseless ailment where it caused because of the high sugar level
in the circulatory system. It is caused because of the inappropriate working of the pancreatic
beta cells. It has an impact on different parts of the body which incorporates pancreas related
diseases, heart diseases, hypertension, kidney related issues, pancreatic issues, nerve harm, foot
issues, vision related issues, glaucoma etc. Diabetes is rapidly growing disease, which keeps on
affecting several people around the globe. Its diagnosis, prediction, proper cure, and
management are crucial. Data mining based forecasting techniques for data analysis of diabetes
A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 777 editor@iaeme.com
can help in the early detection and prediction of the disease. The main cause of diabetes remains
unknown, yet scientists believe that both genetic factors and environmental lifestyle play a
major role in diabetes. Even though it’s incurable, it can be managed by treatment and
medication. Individuals with diabetes face a risk of developing many secondary health issues.
So, early detection and treatment of diabetes becomes more important.
From the investigation of related literature, the authors found that, though there are many
research works have worked on the theme of prediction of diabetes, the accuracy of the methods
is varying one. In addition, achieving the desirable accuracy of diabetic prediction is still an
open research issue. For any prediction problem, identifying relevant features becomes an
important prerequisite as the selection of features influences the prediction accuracy. Secondly,
preprocessing of data also plays a major role in influencing the accuracy of prediction. Thirdly,
the architecture in which the prediction is performed also influences the prediction. Keeping
the above points in mind, a conceptual architecture has been proposed to improve the prediction
accuracy of diabetes. More specifically, the objective of the proposed research is to provide a
conceptual model for analyzing how the accuracy and the performance of the predication can
be improved with respect to alternate, features extraction techniques, classification techniques
in a scalable environment established using big data tools and techniques.
2. RELATED WORK
Significant amount of research works is being carried out in the area of prediction of diabetes
using machine learning techniques [1-8]. There are some research works that have employed
deep learning techniques for the prediction of diabetes[9-12]. Despite the existence of different
machine learning and deep learning models, there are other aspects namely preprocessing of
data, feature selection/extraction, technologies used for data analysis also influence the
prediction process.
One of the important investigations done is the identification of a formula for computing
Diabetes Pedigree Function[13]. This research work provides a way to compute diabetes
pedigree function which is basically dependent on the details of whether the ancestors of a
particular person is affected with diabetes or not. In this aspect, it tries to include the genetic
information in regard to diabetic disease. This work becomes important when one really
validates the machine learning model which at first constructed using benchmark dataset.
3. PROPOSED APPROACH
With a preliminary study with existing literature, it is found that accuracy and performance of
prediction of diabetes are still needed to be improved. In this aspect, the present work aims to
enhance the prediction of diabetes with efficient preprocessing techniques, feature
selection/extraction techniques, classification of disease using machine/deep learning technique
in a scalable environment, established using bigdata technologies such as Apache Spark. The
higher-level conceptual schematic of the proposed work is given in Fig. 1.
4. DATA COLLECTION
It is proposed to use dataset taken from the National Institute of Diabetes and Digestive and
Kidney Diseases (publicly available at: UCI ML Repository [14]). The Pima Indian Diabetes
(PID) dataset(which is available in the reference 34,) having: 9 = 8 + 1 (Class Attribute)
attributes, 768 records describing female patients (of which there were 500 negative instances
(65.1%) and 268 positive instances (34.9%)). The detailed description of all attributes is given
in Table 1.
Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 778 editor@iaeme.com
4.1. Data Preprocessing
In this step, the data collected is preprocessing for incomplete or missing values. In addition,
other transformation such as conversion of continuous data into categorical as required by
classification algorithms are also performed.
4.2. Feature Selection/Extraction
Fundamentally any classification algorithm classifies an incoming raw data which consists of
a set of attributes into one of the predefined class labels. For example, for a given set of
attributes, we need to classify whether the person is having diabetes or not. So, a data needs to
be classified into predefined class based on attributes. The attributes are the essential aspects
Figure 1 Higher level schematic of the proposed approach
or factors used to classify an instance into two classes, say for example, whether a person
is having diabetes or not.
There are two major feature selection techniques, namely, filter methods and wrapper
methods. It is proposed to employ different feature selection techniques such as correlation
A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 779 editor@iaeme.com
based feature selection, gain ratio method, info gain, PCA, Relief method etc. Correlation is a
measure of the linear relationship each attribute and class label. From the value of correlation
measure, one can easily find out relevant attributes or highly correlated attributes. Gain ratio
method helps in reducing bias when multivalued attributes are involved in a prediction problem.
Info gain computes the reduction in entropy and determines the information gain of each
attribute in the context of the target label. Relief algorithm basically uses instance based
learning and it evaluates the quality of an attribute by analysing how good an attribute in
separating instances having instance values near to each other.
Table 1 Dataset description
S.No Attribute Name Attribute Description Mean ±S.D
1 Pregnancies Number of times a woman got pregnant 3.8± 3.3
2 Glucose(mg/dl) Glucose concentration in oral glucose
tolerance test for 120 min
120.8±31.9
3 Blood Pressure (mmHg) Diastolic Blood Pressure 69.1±19.3
4 Skin Thickness(mm) Fold Thickness of Skin 20.5±15.9
5 Insulin (mu U/mL) Serum Insulin for 2h 79.7±115.2
6 BMI(kg/m2) Body Mass Index(weight/(height)^2) 31.9±7.8
7 Diabetes Pedigree
Function
Diabetes Pedigree Function 0.4±0.3
8 Age Age (Years) 33.2±11.7
9 Outcome Class Variable (class value 1 for positive
0 for Negative diabetes
4.3. Classification Techniques
After feature selection, it is proposed to use different machine learning algorithms such as Naïve
Bayes, Random Forest, Support Vector Machine, etc. Naïve Bayes algorithms is simple multi
class probability based classifier. Random Forest classifier is basically a combination of many
decision tree classifiers and it employs a voting mechanism to finalize the class label of a given
data. It is a fast algorithm and it is capable of handling huge data sizes. Support Vector Machine
is unique in handling non-linearity in data by transforming it into a high dimension and by using
kernel functions.
4.4. Big Data based Architecture
It is proposed to carry out the analysis in an architecture which is capable of providing
scalability. So, it is proposed to perform the prediction in big data architecture which is
established used Apache Spark. Apache Spark [15-16] is a distributed platform used to process
big data in a distributed fashion. One of the important features of Spark platform is it has the
capability of handling data by keeping it in an in-memory space rather bringing the data from
hard drives as in Hadoop. This feature provides fast performance. In addition, the spark
architecture is fundamentally a distributed architecture and provides the required scalability.
When we employ deep learning algorithms, obviously the volume of data involved for analysis
would be more and the big data based architecture facilitates an efficient prediction.
Spark platform consists of various components such as Spark core (or) Resilient Distributed
Dataset (RDD), Spark SQL like functions for structured data, Machine Learning library,
Streaming Data API and Graph processing API (GraphX) as shown in Fig. 2.
Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 780 editor@iaeme.com
Figure 2 Components of Spark
5. EVALUATION
The performance of the algorithms will be evaluated using different measure such as accuracy,
computation time and scalability. Further, The performance of the algorithms will be calculated
using confusion matrix. Confusion matrix is represented using four measures, namely, True
Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) which are
defined as follows.
• True Positive (TP) represents that the actual value as true and predicted value as true.
• False Positive (FP) denotes that the actual value as false and predicted value as true.
• False Negative (FN) represents that the actual value as true and predicted value as false.
• True Negative (TN) represents that the actual value as false and predicted value as false.
A sample confusion matrix looks like as given in Fig. 3.
Figure 3 Confusion matrix
Further, the mathematical formulae for different evaluation measures, namely, accuracy,
precision, recall and F-score are given through the following equations
(i) Accuracy
Accuracy is calculated as the number of all correct predictions divided by the total number of
the sample. Accuracy can be calculated with the following formula,
A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and
Big Data Based Architecture
https://iaeme.com/Home/journal/IJARET 781 editor@iaeme.com
TP +TN
Accuracy =
TP +TN +FN +FP
(ii) Precision
Precision is calculated as the number of correct positive predictions divided by the total number
of positive predictions. The formula for calculating prediction is given below:
TP
Precision =
TP + FP
(iii) Recall
Recall is calculated as the number of correct positive predictions divided by the total number
of positives. Recall can be calculated as,
TP
Recall =
TP +FN
(iv) F – score
F – score is a harmonic mean which helps to measure recall and precision at the same time. It
can be calculated as,
Recall*Precision
F-score = 2*
Recall+Precision
6. CONCLUSION
A conceptual model has been proposed for the prediction of diabetes using machine learning
algorithms, with an intention to improve the prediction accuracy in a big data based architecture.
In addition, the scalable architecture, it is proposed to use different feature selection techniques
for finding relevant attributes. The proposed architecture will be implemented and will be
analysed with mentioned benchmark dataset. After that, the architecture will be validated with
real data collected from individuals. Elaborate experimentation will be performed with many
datasets and a framework will be developed.
REFERENCES
[1] Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia
Comput. Sci. 2018, 132, 1578–1585.
[2] Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and
Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–
2528.
[3] Negi, A.; Jaiswal, V. A first attempt to develop a diabetes prediction method based on different
global datasets. In Proceedings of the 2016 Fourth International Conference on Parallel,
Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 237–
241.
[4] Soltani, Z.; Jafarian, A. A New Artificial Neural Networks Approach for Diagnosing Diabetes
Disease Type II. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 89–94.
Chellammal Surianarayanan, Sharmila Rengasamy
https://iaeme.com/Home/journal/IJARET 782 editor@iaeme.com
[5] Somnath, R.; Suvojit, M.; Sanket, B.; Riyanka, K.; Priti, G.; Sayantan, M.; Subhas, B. Prediction
of Diabetes Type-II Using a Two-Class Neural Network. In Proceedings of the 2017
International Conference on Computational Intelligence, Communications, and Business
Analytics, Kolkata, India, 24–25 March 2017; pp. 65–71.
[6] Mamuda, M.; Sathasivam, S. Predicting the survival of diabetes using neural network. In
Proceedings of the AIP Conference Proceedings, Bydgoszcz, Poland, 9–11 May 2017; Volume
1870, pp. 40–46.
[7] Malik, S.; Khadgawat, R.; Anand, S.; Gupta, S. Non-invasive detection of fasting blood glucose
level via electrochemical measurement of saliva. SpringerPlus 2016, 5, 701.
[8] Perveen, S.; Shahbaz, M.; Guergachi, A.; Keshavjee, K. Performance Analysis of Data Mining
Classification Techniques to Predict Diabetes. Procedia Comput. Sci. 2016, 82, 115–121.
[9] Mohebbi, A.; Aradóttir, T.B.; Johansen, A.R.; Bengtsson, H.; Fraccaro, M.; Mørup, M. A deep
learning approach to adherence detection for type 2 diabetics. In Proceedings of the 2017 39th
Annual International Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Jeju, Korea, 11–15 July 2017; pp. 2896–2899.
[10] Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to
Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094.
[11] Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting healthcare trajectories from medical
records: A deep learning approach. J. Biomed. Inform. 2017, 69, 218–229.
[12] Lekha, S.; Suchetha, M. Real-Time Non-Invasive Detection and Classification of Diabetes
Using Modified Convolution Neural Network. IEEE J. Biomed. Health Inform. 2018, 22, 1630–
1636.
[13] Smith, Jack, Everhard, J, Dickson, W, Johannes, Richard, “Using the ADAP Learning
Algorithm to Forcast the Onset of Diabetes Mellitus”, Proceedings of Annual Symposium on
Computer Applications in Medical Care, November 1988, from PubMed Central
[14] M. Lichman, "Pima Indians diabetes database," ed. Center for machine learning and intelligent
systems.: UCI Machine Learning repository.
[15] Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark
Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data
Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975-4172, p-ISSN: 0975-4350,
Vol. 15, Issue 1, pp. 1-10, January 2015.
[16] Soumya Ounacer, Mohamed Amine Talhaoui, Soufiane Ardchir, Abderrahmane Daif and
Mohamed Azouazi, “A New Architecture for Real Time Data Stream Processing”, International
Journal of Advanced Computer Science and Applications (IJACSA), Vol. 8, Issue 11, pp. 44-
51, 2017.

More Related Content

Similar to A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATURE SELECTION AND BIG DATA BASED ARCHITECTURE

Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...IRJET Journal
 
Diabetes prediction using machine learning
Diabetes prediction using machine learningDiabetes prediction using machine learning
Diabetes prediction using machine learningdataalcott
 
Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...
Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...
Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...IRJET Journal
 
Diabetes Prediction Using ML
Diabetes Prediction Using MLDiabetes Prediction Using ML
Diabetes Prediction Using MLIRJET Journal
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...IAEME Publication
 
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...BRNSSPublicationHubI
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningIRJET Journal
 
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...IRJET Journal
 
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNINGDIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNINGIRJET Journal
 
Prediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability ApproachPrediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability ApproachIRJET Journal
 
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...IRJET Journal
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisjournalBEEI
 
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEM
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMAN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEM
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMijaia
 
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
IRJET-  	  Diabetes Prediction by Machine Learning over Big Data from Healthc...IRJET-  	  Diabetes Prediction by Machine Learning over Big Data from Healthc...
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...IRJET Journal
 
Sepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine LearningSepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine LearningIRJET Journal
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataIJECEIAES
 
IRJET- Predicting Diabetes Disease using Effective Classification Techniques
IRJET-  	  Predicting Diabetes Disease using Effective Classification TechniquesIRJET-  	  Predicting Diabetes Disease using Effective Classification Techniques
IRJET- Predicting Diabetes Disease using Effective Classification TechniquesIRJET Journal
 
Heart Disease Prediction using Data Mining
Heart Disease Prediction using Data MiningHeart Disease Prediction using Data Mining
Heart Disease Prediction using Data MiningIRJET Journal
 
Multiple Disease Prediction System: A Review
Multiple Disease Prediction System: A ReviewMultiple Disease Prediction System: A Review
Multiple Disease Prediction System: A ReviewIRJET Journal
 

Similar to A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATURE SELECTION AND BIG DATA BASED ARCHITECTURE (20)

Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
Early Stage Diabetic Disease Prediction and Risk Minimization using Machine L...
 
Diabetes prediction using machine learning
Diabetes prediction using machine learningDiabetes prediction using machine learning
Diabetes prediction using machine learning
 
Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...
Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...
Implementation of a Web Application to Foresee and Pretreat Diabetes Mellitus...
 
Diabetes Prediction Using ML
Diabetes Prediction Using MLDiabetes Prediction Using ML
Diabetes Prediction Using ML
 
Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...Improving the performance of k nearest neighbor algorithm for the classificat...
Improving the performance of k nearest neighbor algorithm for the classificat...
 
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
Performance Evaluation of Data Mining Algorithm on Electronic Health Record o...
 
Multi Disease Detection using Deep Learning
Multi Disease Detection using Deep LearningMulti Disease Detection using Deep Learning
Multi Disease Detection using Deep Learning
 
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
Analysis and Prediction of Diabetes Diseases using Machine Learning Algorithm...
 
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNINGDIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
DIABETES PROGNOSTICATION UTILIZING MACHINE LEARNING
 
Prediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability ApproachPrediction of Diabetes using Probability Approach
Prediction of Diabetes using Probability Approach
 
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
IRJET - Deep Multiple Instance Learning for Automatic Detection of Diabetic R...
 
An efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysisAn efficient feature selection algorithm for health care data analysis
An efficient feature selection algorithm for health care data analysis
 
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEM
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEMAN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEM
AN IMPROVED MODEL FOR CLINICAL DECISION SUPPORT SYSTEM
 
50120140506011
5012014050601150120140506011
50120140506011
 
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
IRJET-  	  Diabetes Prediction by Machine Learning over Big Data from Healthc...IRJET-  	  Diabetes Prediction by Machine Learning over Big Data from Healthc...
IRJET- Diabetes Prediction by Machine Learning over Big Data from Healthc...
 
Sepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine LearningSepsis Prediction Using Machine Learning
Sepsis Prediction Using Machine Learning
 
Hybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer dataHybrid filtering methods for feature selection in high-dimensional cancer data
Hybrid filtering methods for feature selection in high-dimensional cancer data
 
IRJET- Predicting Diabetes Disease using Effective Classification Techniques
IRJET-  	  Predicting Diabetes Disease using Effective Classification TechniquesIRJET-  	  Predicting Diabetes Disease using Effective Classification Techniques
IRJET- Predicting Diabetes Disease using Effective Classification Techniques
 
Heart Disease Prediction using Data Mining
Heart Disease Prediction using Data MiningHeart Disease Prediction using Data Mining
Heart Disease Prediction using Data Mining
 
Multiple Disease Prediction System: A Review
Multiple Disease Prediction System: A ReviewMultiple Disease Prediction System: A Review
Multiple Disease Prediction System: A Review
 

Recently uploaded

Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...akbard9823
 
Gomti Nagar & High Profile Call Girls in Lucknow (Adult Only) 8923113531 Esc...
Gomti Nagar & High Profile Call Girls in Lucknow  (Adult Only) 8923113531 Esc...Gomti Nagar & High Profile Call Girls in Lucknow  (Adult Only) 8923113531 Esc...
Gomti Nagar & High Profile Call Girls in Lucknow (Adult Only) 8923113531 Esc...gurkirankumar98700
 
Call Girl Service In Dubai #$# O56521286O #$# Dubai Call Girls
Call Girl Service In Dubai #$# O56521286O #$# Dubai Call GirlsCall Girl Service In Dubai #$# O56521286O #$# Dubai Call Girls
Call Girl Service In Dubai #$# O56521286O #$# Dubai Call Girlsparisharma5056
 
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiMalviyaNagarCallGirl
 
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...akbard9823
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiMalviyaNagarCallGirl
 
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in DowntownDowntown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtowndajasot375
 
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...dajasot375
 
FULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | DelhiMalviyaNagarCallGirl
 
exhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptxexhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptxKurikulumPenilaian
 
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service AkolaAkola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service Akolasrsj9000
 
Jeremy Casson - An Architectural and Historical Journey Around Europe
Jeremy Casson - An Architectural and Historical Journey Around EuropeJeremy Casson - An Architectural and Historical Journey Around Europe
Jeremy Casson - An Architectural and Historical Journey Around EuropeJeremy Casson
 
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad EscortsIslamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escortswdefrd
 
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...gurkirankumar98700
 
Alex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson StoryboardAlex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson Storyboardthephillipta
 
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...akbard9823
 
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad EscortsCall girls in Ahmedabad High profile
 
FULL ENJOY - 9953040155 Call Girls in Indirapuram | Delhi
FULL ENJOY - 9953040155 Call Girls in Indirapuram | DelhiFULL ENJOY - 9953040155 Call Girls in Indirapuram | Delhi
FULL ENJOY - 9953040155 Call Girls in Indirapuram | DelhiMalviyaNagarCallGirl
 
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur DubaiBur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubaidajasot375
 
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiFULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiMalviyaNagarCallGirl
 

Recently uploaded (20)

Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
Hazratganj ] (Call Girls) in Lucknow - 450+ Call Girl Cash Payment 🧄 89231135...
 
Gomti Nagar & High Profile Call Girls in Lucknow (Adult Only) 8923113531 Esc...
Gomti Nagar & High Profile Call Girls in Lucknow  (Adult Only) 8923113531 Esc...Gomti Nagar & High Profile Call Girls in Lucknow  (Adult Only) 8923113531 Esc...
Gomti Nagar & High Profile Call Girls in Lucknow (Adult Only) 8923113531 Esc...
 
Call Girl Service In Dubai #$# O56521286O #$# Dubai Call Girls
Call Girl Service In Dubai #$# O56521286O #$# Dubai Call GirlsCall Girl Service In Dubai #$# O56521286O #$# Dubai Call Girls
Call Girl Service In Dubai #$# O56521286O #$# Dubai Call Girls
 
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
 
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...Patrakarpuram ) Cheap Call Girls In Lucknow  (Adult Only) 🧈 8923113531 𓀓 Esco...
Patrakarpuram ) Cheap Call Girls In Lucknow (Adult Only) 🧈 8923113531 𓀓 Esco...
 
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Paschim Vihar | Delhi
 
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in DowntownDowntown Call Girls O5O91O128O Pakistani Call Girls in Downtown
Downtown Call Girls O5O91O128O Pakistani Call Girls in Downtown
 
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
 
FULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Old Rajendra Nagar | Delhi
 
exhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptxexhuma plot and synopsis from the exhuma movie.pptx
exhuma plot and synopsis from the exhuma movie.pptx
 
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service AkolaAkola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
 
Jeremy Casson - An Architectural and Historical Journey Around Europe
Jeremy Casson - An Architectural and Historical Journey Around EuropeJeremy Casson - An Architectural and Historical Journey Around Europe
Jeremy Casson - An Architectural and Historical Journey Around Europe
 
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad EscortsIslamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
 
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
Charbagh / best call girls in Lucknow - Book 🥤 8923113531 🪗 Call Girls Availa...
 
Alex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson StoryboardAlex and Chloe by Daniel Johnson Storyboard
Alex and Chloe by Daniel Johnson Storyboard
 
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
 
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
 
FULL ENJOY - 9953040155 Call Girls in Indirapuram | Delhi
FULL ENJOY - 9953040155 Call Girls in Indirapuram | DelhiFULL ENJOY - 9953040155 Call Girls in Indirapuram | Delhi
FULL ENJOY - 9953040155 Call Girls in Indirapuram | Delhi
 
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur DubaiBur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
Bur Dubai Call Girls O58993O4O2 Call Girls in Bur Dubai
 
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiFULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
 

A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATURE SELECTION AND BIG DATA BASED ARCHITECTURE

  • 1. https://iaeme.com/Home/journal/IJARET 776 editor@iaeme.com International Journal of Advanced Research in Engineering and Technology (IJARET) Volume 10, Issue 2, March - April 2019, pp. 776-782, Article ID: IJARET_10_02_073 Available online at https://iaeme.com/Home/issue/IJARET?Volume=10&Issue=2 ISSN Print: 0976-6480 and ISSN Online: 0976-6499 DOI: https://doi.org/10.34218/IJARET.10.2.2019.073 © IAEME Publication Scopus Indexed A CONCEPTUAL APPROACH TO ENHANCE PREDICTION OF DIABETES USING ALTERNATE FEATURE SELECTION AND BIG DATA BASED ARCHITECTURE Chellammal Surianarayanan, Sharmila Rengasamy Bharathidasan University Constituent Arts and Science College, Affiliated to Bharathidasan University, Srirangam, Tiruchirappalli, Tamil Nadu, India ABSTRACT Machine learning algorithms play a vital role in prediction of many diseases such as heart disease, diabetes, cancer, lung disease etc. The applicability of machine learning algorithms to healthcare domain relieves the burden of physicians as it is impractical to scan manually all the data collected over a period of time in order to arrive at some valuable information. Machine learning algorithms learn from the training dataset and they become capable of thinking like a human. Once the algorithm completes it learning with training dataset, it can automatically predict the target output label of any unseen data. In this work, predicting diabetes using machine learning algorithms has been taken up. A conceptual architecture has been proposed based on big data architecture. Keywords: Diabetes prediction, big data based architecture for disease prediction, prediction of diabetes using machine learning algorithms Cite this Article: Chellammal Surianarayanan, Sharmila Rengasamy, A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and Big Data Based Architecture, International Journal of Advanced Research in Engineering and Technology, 10(2), 2019, pp 776-782. https://iaeme.com/Home/issue/IJARET?Volume=10&Issue=2 1. INTRODUCTION Diabetes mellitus is chronic, a ceaseless ailment where it caused because of the high sugar level in the circulatory system. It is caused because of the inappropriate working of the pancreatic beta cells. It has an impact on different parts of the body which incorporates pancreas related diseases, heart diseases, hypertension, kidney related issues, pancreatic issues, nerve harm, foot issues, vision related issues, glaucoma etc. Diabetes is rapidly growing disease, which keeps on affecting several people around the globe. Its diagnosis, prediction, proper cure, and management are crucial. Data mining based forecasting techniques for data analysis of diabetes
  • 2. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and Big Data Based Architecture https://iaeme.com/Home/journal/IJARET 777 editor@iaeme.com can help in the early detection and prediction of the disease. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes. Even though it’s incurable, it can be managed by treatment and medication. Individuals with diabetes face a risk of developing many secondary health issues. So, early detection and treatment of diabetes becomes more important. From the investigation of related literature, the authors found that, though there are many research works have worked on the theme of prediction of diabetes, the accuracy of the methods is varying one. In addition, achieving the desirable accuracy of diabetic prediction is still an open research issue. For any prediction problem, identifying relevant features becomes an important prerequisite as the selection of features influences the prediction accuracy. Secondly, preprocessing of data also plays a major role in influencing the accuracy of prediction. Thirdly, the architecture in which the prediction is performed also influences the prediction. Keeping the above points in mind, a conceptual architecture has been proposed to improve the prediction accuracy of diabetes. More specifically, the objective of the proposed research is to provide a conceptual model for analyzing how the accuracy and the performance of the predication can be improved with respect to alternate, features extraction techniques, classification techniques in a scalable environment established using big data tools and techniques. 2. RELATED WORK Significant amount of research works is being carried out in the area of prediction of diabetes using machine learning techniques [1-8]. There are some research works that have employed deep learning techniques for the prediction of diabetes[9-12]. Despite the existence of different machine learning and deep learning models, there are other aspects namely preprocessing of data, feature selection/extraction, technologies used for data analysis also influence the prediction process. One of the important investigations done is the identification of a formula for computing Diabetes Pedigree Function[13]. This research work provides a way to compute diabetes pedigree function which is basically dependent on the details of whether the ancestors of a particular person is affected with diabetes or not. In this aspect, it tries to include the genetic information in regard to diabetic disease. This work becomes important when one really validates the machine learning model which at first constructed using benchmark dataset. 3. PROPOSED APPROACH With a preliminary study with existing literature, it is found that accuracy and performance of prediction of diabetes are still needed to be improved. In this aspect, the present work aims to enhance the prediction of diabetes with efficient preprocessing techniques, feature selection/extraction techniques, classification of disease using machine/deep learning technique in a scalable environment, established using bigdata technologies such as Apache Spark. The higher-level conceptual schematic of the proposed work is given in Fig. 1. 4. DATA COLLECTION It is proposed to use dataset taken from the National Institute of Diabetes and Digestive and Kidney Diseases (publicly available at: UCI ML Repository [14]). The Pima Indian Diabetes (PID) dataset(which is available in the reference 34,) having: 9 = 8 + 1 (Class Attribute) attributes, 768 records describing female patients (of which there were 500 negative instances (65.1%) and 268 positive instances (34.9%)). The detailed description of all attributes is given in Table 1.
  • 3. Chellammal Surianarayanan, Sharmila Rengasamy https://iaeme.com/Home/journal/IJARET 778 editor@iaeme.com 4.1. Data Preprocessing In this step, the data collected is preprocessing for incomplete or missing values. In addition, other transformation such as conversion of continuous data into categorical as required by classification algorithms are also performed. 4.2. Feature Selection/Extraction Fundamentally any classification algorithm classifies an incoming raw data which consists of a set of attributes into one of the predefined class labels. For example, for a given set of attributes, we need to classify whether the person is having diabetes or not. So, a data needs to be classified into predefined class based on attributes. The attributes are the essential aspects Figure 1 Higher level schematic of the proposed approach or factors used to classify an instance into two classes, say for example, whether a person is having diabetes or not. There are two major feature selection techniques, namely, filter methods and wrapper methods. It is proposed to employ different feature selection techniques such as correlation
  • 4. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and Big Data Based Architecture https://iaeme.com/Home/journal/IJARET 779 editor@iaeme.com based feature selection, gain ratio method, info gain, PCA, Relief method etc. Correlation is a measure of the linear relationship each attribute and class label. From the value of correlation measure, one can easily find out relevant attributes or highly correlated attributes. Gain ratio method helps in reducing bias when multivalued attributes are involved in a prediction problem. Info gain computes the reduction in entropy and determines the information gain of each attribute in the context of the target label. Relief algorithm basically uses instance based learning and it evaluates the quality of an attribute by analysing how good an attribute in separating instances having instance values near to each other. Table 1 Dataset description S.No Attribute Name Attribute Description Mean ±S.D 1 Pregnancies Number of times a woman got pregnant 3.8± 3.3 2 Glucose(mg/dl) Glucose concentration in oral glucose tolerance test for 120 min 120.8±31.9 3 Blood Pressure (mmHg) Diastolic Blood Pressure 69.1±19.3 4 Skin Thickness(mm) Fold Thickness of Skin 20.5±15.9 5 Insulin (mu U/mL) Serum Insulin for 2h 79.7±115.2 6 BMI(kg/m2) Body Mass Index(weight/(height)^2) 31.9±7.8 7 Diabetes Pedigree Function Diabetes Pedigree Function 0.4±0.3 8 Age Age (Years) 33.2±11.7 9 Outcome Class Variable (class value 1 for positive 0 for Negative diabetes 4.3. Classification Techniques After feature selection, it is proposed to use different machine learning algorithms such as Naïve Bayes, Random Forest, Support Vector Machine, etc. Naïve Bayes algorithms is simple multi class probability based classifier. Random Forest classifier is basically a combination of many decision tree classifiers and it employs a voting mechanism to finalize the class label of a given data. It is a fast algorithm and it is capable of handling huge data sizes. Support Vector Machine is unique in handling non-linearity in data by transforming it into a high dimension and by using kernel functions. 4.4. Big Data based Architecture It is proposed to carry out the analysis in an architecture which is capable of providing scalability. So, it is proposed to perform the prediction in big data architecture which is established used Apache Spark. Apache Spark [15-16] is a distributed platform used to process big data in a distributed fashion. One of the important features of Spark platform is it has the capability of handling data by keeping it in an in-memory space rather bringing the data from hard drives as in Hadoop. This feature provides fast performance. In addition, the spark architecture is fundamentally a distributed architecture and provides the required scalability. When we employ deep learning algorithms, obviously the volume of data involved for analysis would be more and the big data based architecture facilitates an efficient prediction. Spark platform consists of various components such as Spark core (or) Resilient Distributed Dataset (RDD), Spark SQL like functions for structured data, Machine Learning library, Streaming Data API and Graph processing API (GraphX) as shown in Fig. 2.
  • 5. Chellammal Surianarayanan, Sharmila Rengasamy https://iaeme.com/Home/journal/IJARET 780 editor@iaeme.com Figure 2 Components of Spark 5. EVALUATION The performance of the algorithms will be evaluated using different measure such as accuracy, computation time and scalability. Further, The performance of the algorithms will be calculated using confusion matrix. Confusion matrix is represented using four measures, namely, True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) which are defined as follows. • True Positive (TP) represents that the actual value as true and predicted value as true. • False Positive (FP) denotes that the actual value as false and predicted value as true. • False Negative (FN) represents that the actual value as true and predicted value as false. • True Negative (TN) represents that the actual value as false and predicted value as false. A sample confusion matrix looks like as given in Fig. 3. Figure 3 Confusion matrix Further, the mathematical formulae for different evaluation measures, namely, accuracy, precision, recall and F-score are given through the following equations (i) Accuracy Accuracy is calculated as the number of all correct predictions divided by the total number of the sample. Accuracy can be calculated with the following formula,
  • 6. A Conceptual Approach to Enhance Prediction of Diabetes using Alternate Feature Selection and Big Data Based Architecture https://iaeme.com/Home/journal/IJARET 781 editor@iaeme.com TP +TN Accuracy = TP +TN +FN +FP (ii) Precision Precision is calculated as the number of correct positive predictions divided by the total number of positive predictions. The formula for calculating prediction is given below: TP Precision = TP + FP (iii) Recall Recall is calculated as the number of correct positive predictions divided by the total number of positives. Recall can be calculated as, TP Recall = TP +FN (iv) F – score F – score is a harmonic mean which helps to measure recall and precision at the same time. It can be calculated as, Recall*Precision F-score = 2* Recall+Precision 6. CONCLUSION A conceptual model has been proposed for the prediction of diabetes using machine learning algorithms, with an intention to improve the prediction accuracy in a big data based architecture. In addition, the scalable architecture, it is proposed to use different feature selection techniques for finding relevant attributes. The proposed architecture will be implemented and will be analysed with mentioned benchmark dataset. After that, the architecture will be validated with real data collected from individuals. Elaborate experimentation will be performed with many datasets and a framework will be developed. REFERENCES [1] Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Comput. Sci. 2018, 132, 1578–1585. [2] Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519– 2528. [3] Negi, A.; Jaiswal, V. A first attempt to develop a diabetes prediction method based on different global datasets. In Proceedings of the 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), Waknaghat, India, 22–24 December 2016; pp. 237– 241. [4] Soltani, Z.; Jafarian, A. A New Artificial Neural Networks Approach for Diagnosing Diabetes Disease Type II. Int. J. Adv. Comput. Sci. Appl. 2016, 7, 89–94.
  • 7. Chellammal Surianarayanan, Sharmila Rengasamy https://iaeme.com/Home/journal/IJARET 782 editor@iaeme.com [5] Somnath, R.; Suvojit, M.; Sanket, B.; Riyanka, K.; Priti, G.; Sayantan, M.; Subhas, B. Prediction of Diabetes Type-II Using a Two-Class Neural Network. In Proceedings of the 2017 International Conference on Computational Intelligence, Communications, and Business Analytics, Kolkata, India, 24–25 March 2017; pp. 65–71. [6] Mamuda, M.; Sathasivam, S. Predicting the survival of diabetes using neural network. In Proceedings of the AIP Conference Proceedings, Bydgoszcz, Poland, 9–11 May 2017; Volume 1870, pp. 40–46. [7] Malik, S.; Khadgawat, R.; Anand, S.; Gupta, S. Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva. SpringerPlus 2016, 5, 701. [8] Perveen, S.; Shahbaz, M.; Guergachi, A.; Keshavjee, K. Performance Analysis of Data Mining Classification Techniques to Predict Diabetes. Procedia Comput. Sci. 2016, 82, 115–121. [9] Mohebbi, A.; Aradóttir, T.B.; Johansen, A.R.; Bengtsson, H.; Fraccaro, M.; Mørup, M. A deep learning approach to adherence detection for type 2 diabetics. In Proceedings of the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Jeju, Korea, 11–15 July 2017; pp. 2896–2899. [10] Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Sci. Rep. 2016, 6, 26094. [11] Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting healthcare trajectories from medical records: A deep learning approach. J. Biomed. Inform. 2017, 69, 218–229. [12] Lekha, S.; Suchetha, M. Real-Time Non-Invasive Detection and Classification of Diabetes Using Modified Convolution Neural Network. IEEE J. Biomed. Health Inform. 2018, 22, 1630– 1636. [13] Smith, Jack, Everhard, J, Dickson, W, Johannes, Richard, “Using the ADAP Learning Algorithm to Forcast the Onset of Diabetes Mellitus”, Proceedings of Annual Symposium on Computer Applications in Medical Care, November 1988, from PubMed Central [14] M. Lichman, "Pima Indians diabetes database," ed. Center for machine learning and intelligent systems.: UCI Machine Learning repository. [15] Abdul Ghaffar Shoro and Tariq Rahim Soomro, “Big Data Analysis: Apache Spark Perspective”, Global Journal of Computer Science and Technology: Computer Software & Data Engineering, Publisher: Global Journal Inc., USA, e-ISSN: 0975-4172, p-ISSN: 0975-4350, Vol. 15, Issue 1, pp. 1-10, January 2015. [16] Soumya Ounacer, Mohamed Amine Talhaoui, Soufiane Ardchir, Abderrahmane Daif and Mohamed Azouazi, “A New Architecture for Real Time Data Stream Processing”, International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 8, Issue 11, pp. 44- 51, 2017.