2. Heart Disease
Prediction System
using Exploratory
Data Analysis
Mentored By-
Mr.Nagarjuna
Submitted By-
P.Pravallika(CSE/576)
N.Divya(CSE/570)
Y.Joshua(CSE/5A6)
N.MukeshKrishna(CSE/569)
3. TABLE OF CONTENTS:
1. Abstract
2. Introduction
3. Existing System
4. Proposed System
5. System Specifications
6. System Architecture
7. Problem Statement
8. Data Collection
9. Technology Used
10. Data Cleaning
11. Library Used
12. Tableau Dashboard
13. Conclusion
4. Abstract
● With an opulence of data , healthcare is being developed by the
application of machine learning.
● Cardiovascular disease is one of the most fatal conditions in the
present world. In case of heart diseases, the correct diagnosis in early
stage is important as time is the very important factor.
● We are building a Heart Disease Prediction system to predict the
chance of heart disease , for this we use different algorithms
like Logistic Regression and Random Forests by giving age, gender,
blood pressure etc as input. As a output it gives the chance of getting
heart disease.
5. Introduction
Heart disease predictor is an offline platform designed and developed to
explore the path of machine learning . The goal is to predict the health of a
patient from collective data, so as to be able to detect configurations at risk
for the patient, and therefore, in cases requiring emergency medical
assistance, alert the appropriate medical staff of the situation of the latter.
We initially have a dataset collecting information of many patients with which
we are able to conclude the results into a complete form and can predict data
precisely. The results of the predictions, derived from the predictive models
generated by machine learning, will be presented through several distinct
graphical interfaces according to the datasets considered. We will then bring
criticism as to the scope of our results.
6. Existing System
● Diagnosis of the disease solely depends upon the docter’s intuition and patient’s
records.
● Researchers made use of several data mining techniques that are accessible to help
the specialists or physicians identify the heart disease. One of them is Naïve Bayes
algorithm.
● The disadvantages of this prediction are, cardiovascular disease results are
not accurate , cannot handle enormous datasets for patient records.
Disadvantages:
● This practice leads to unwanted biases ,errors and excessive medical costs which
effects the quality of service provided to patients.
● There are many ways that a medical misdiagnosis can represent itself .
7. Proposed System
● Machine learning techniques are used to increase the accuracy rate.
● In machine learning technique we can use the following algorithms on
huge datasets to predict the heart disease.
● 1.Logistic Regression
● 2.Random Forest
● Logistic Regression algorithm is used to improve the accuracy of the
system.
● By using this ,the proposed system acts as a decision support system
and will predict the chances of heart diseases.
8. System Specifications
Software requirements:
● OS : Windows
● Python IDE : Python 2.7.x and above Anaconda IDE
● Setup tools and pip to be installed for 3.6.x and above
Hardware requirements:
● RAM : 4GB and Higher
● Processor : Intel i3 and above
● Hard Disk : 500GB: Minimum
10. Problem Statement
Machine learning allows building models to quickly analyze data and
deliver results, leveraging the historical and real-time data, with machine
learning that will help healthcare service providers to make better
decisions on patient’s disease diagnosis.
By analyzing the data we can predict the occurrence of the disease in our
project. This intelligent system for disease prediction plays a major role in
controlling the disease and maintaining the good health status of people
by predicting accurate disease risk.
Machine learning algorithms can also be helpful in providing vital
statistics, real-time data and advanced analytics in terms of the patient’s
disease, lab test results, blood pressure, family history, clinical trial data,
etc., to doctors.
11. Data Collection
Data has been collected from Kaggle.
Data collection is the process of gathering and Measuring information from countless different
sources. In order to use the datawe collect to develop practical Artificial Intelligence (AI) amd
Machine learning solutions it must be collected and stored in a way that makes sense for the
business problem at hand
What is Kaggle?
KAGGLE is an online community of data scientists and machine learners, owned by Google
LLC. Kaggle allows users to find and publish data sets, explore and build models in a web
based data science environment, work with other data scientists and other machine learning
engineers and enter data competitions to solve data science challenges
13. Testing Technologies
Anaconda(Python) - Anaconda is a free and open-source distribution of the
Python and R programming languages for scientific computing, that aims to
simplify package management and deployment.
Jupyter Notebook - The Jupyter Notebook is an open-source web application that
allows you to create and share documents that contain live code, equations,
visualizations and narrative text. Uses include: data cleaning and transformation,
numerical simulation, statistical modeling, data visualization, machine learning,
and much more.
14. Data Cleaning
Data Cleaning is essentially the task of removing errors and anomalies or
replacing observed values with the true values from data to get more values in
analytics .
METHODS
● Get Rid of Extra Spaces.
● Select and Treat All Blank Cells.
● Convert Numbers Stored as Text into Numbers.
● Remove Duplicates.
● Highlight Errors.
● Change Text to Lower/Upper/Proper Case.
● Spell Check.
● Delete all Formatting.
15. Libraries Used
1. Pandas- is a software library written for the Python programming
language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series
pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labeled” data
both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python.
2. .NumPy- NumPy is a library for the Python programming language,
adding support for large, multi-dimensional arrays and matrices, along with
a large collection of high-level mathematical functions to operate on these
arrays.
16. Tableau Dashboard
● Tableau is one of the business intelligence software used to analyse
data and visualize the insights in the form of graph and charts.
● User can develop and share an interactive dashboard which shows the
hidden pattern, trends, density and variation of data.
● Tableau uses centroid-based k-means clustering algorithm that divides
the data into K-number of clusters.
● Dashboards are created with the data set after applying K-means
algorithm.
● It provides visual appealing clusters in order to predict the occurrence
of heart disease from the given dataset.
18. Conclusion
● The models we used to predict the probability of having heart disease are
Logistic regression,Random forest as they are more accurate in numerical
variables. The model accuracy is 85 % in test and train data sets. This model will
would be used in medical field as it can predict the heart diseases .
● Heart stroke and vascular disease are the major cause of disability and
premature death. Chest pain is the key to recognize the heart disease. In this
work, the heart diseases are predicted by considering major factors with four
types of chest pain. K-means clustering is one of the simplest and popular
unsupervised machine learning algorithms. Here the datasets is clustered and
based upon the clusters the happening of chest pain is predicted. The role of
exploratory data using tableau provided a visual appealing and accurate
clustering experience.
18