Increamental Data Clustering

DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 1
Project Semester – 8th
January 10 to April 20, 2019
Department of Computer Science Engineering & Information Technology Symbiosis Institute of
Technology, Pune
Final Year Project Report on
Analysis of Diabetes Dataset Using Distributed Incremental Clustering
Algorithm and AWS
Submitted by:
Aditya Maheshwari 15070121503
Avinash Barfa 15070121507
Kunal Gulati 15070121525
Palash Verma 15070121140
Under the Guidance of:
Dr. Preeti Mulay Mr. Rahul Joshi
Professor, Assistant Professor,
CS and IT Department, CS and IT Department,
SIT, Pune. SIT, Pune.

DECLARATION
We hereby declare that the project work entitled ―Analysis of Diabetes Dataset Using
Distributed Incremental Clustering Algorithm and AWS is an authentic record of our work
carried out at as requirements for final year project for the award of B.Tech degree in Computer
Science and Information Technology Engineering at Symbiosis Institute of Technology Pune,
affiliated to Symbiosis International Deemed University, Pune under the guidance of Dr. Preeti
Mulay and Mr. Rahul Joshi, during January 2019 to April 2019.
Aditya
Maheshwari
Avinash
Barfa
Kunal
Gulati
Palash
Verma
15070121503 15070121507 15070121525 15070121140
Date: ___________________
Certified that the above statement made by the student is correct to the best of our knowledge
and belief.
Dr. Preeti Mulay Mr. Rahul Joshi
Professor, Assistant Professor,
CS and IT Department, CS and IT Department,
SIT, Pune. SIT, Pune.

ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance and assistance from
many people and we are extremely fortunate to have got this all along the completion of our
project work. Whatever we have done is only due to such guidance and assistance and we would
not forget to thank them.
First and foremost, we are expressing our thankfulness and praise to Symbiosis Institute of
Technology, Pune and Department of Computer Science and Information Technology for giving
us this wonderful opportunity to undergo B.Tech Project Work, helping us to learn and attain
great experience.
We would like to thank our HOD, Dr. Shraddha Phansalkar for her valuable guidance,
supervising this work and helpful suggestions.
We owe our profound gratitude and special thanks to Dr. Preeti Mulay for cooperating with us
and giving us her valuable time and information and Prof. Rahul Joshi who in spite of being
extraordinarily busy with his duties, took time out to hear, guide and keep us on the correct path
by his untiring assistance, direction, encouragement, continuous support, valuable ideas and
constructive criticism throughout this project work.
At last we are grateful to our respected teachers Department of Computer Science and
Information Technology SIT, Pune, family and friends for their help, encouragement and co-
operation during the project work.
Aditya
Maheshwari
Avinash
Barfa
Kunal
Gulati
Palash
Verma
15070121503 15070121507 15070121525 15070121140

List of Figures
Fig 1.1 Clusters
Fig 1.2 Data Flow
Fig 5.1 Entity Relation Diagram
Fig 5.2 Use Case Diagram
Fig 5.3 State Chart Diagram
Fig 5.4 Activity Diagram
Fig 6.1 Project Flow
Fig 6.2 Basic Clustering Algorithm Dev Plan
Fig 6.3 Incremental Clustering Algorithm Dev Plan
Fig 6.4 Project Timeline
Fig 7.1 System Model
Fig 7.2 Process Flow
Fig 7.3 Basic Clustering
Fig 7.4 Incremental Clustering
Fig 7.5 AMS Resource Status
Fig 7.6 AMS Service Help
Fig 7.7 Instance Accessing using Putty
Fig 7.8 WinSCP Instance File Structure
Fig 7.9 Putty Instance File Structure
Fig 7.10 Commands to Run python file
Fig 7.11 Python code Execution 1
Fig 7.12 Python code Execution 2
Fig 7.13 AWS Monitoring Console 1
Fig 7.14 AWS Monitoring Console 2
Fig 7.15 CPU Utilization at starting stage
Fig 7.16 CPU Utilization at intermediate stage
Fig 7.17 Network in Status
Fig 7.18 Network Hour Status
Fig 8.1 Required Parameters
Fig 8.2 Closeness Factor
Fig 8.3 Cluster 1
Fig 8.4 Cluster 2
Fig 8.5 Cluster 3
Fig 8.6 Cluster 4

TABLE OF CONTENTS
1. Introduction…………………………………………………………………………………. 7
1.1. Introduction to Clustering
1.2. Introduction to Cloud Environment
1.3. Overview on Diabetes
1.4. Data flow diagram showing CFBA’s statistical details
2. Literature Review………………………………………………………………………….. 13
3. Technical Requirements…………………………………………………………………… 14
3.1. Tools and software used
3.2. Technologies and languages used
4. Software Requirement Analysis……………………………………………………………. 18
4.1. Introduction
4.2. External Interface Requirements
4.2.1. User Interface
4.2.2. Hardware Interface
4.2.3. Software Interface
5. High Level Design……………………………………………………………………….… 20
5.1. Entity Relationship Diagrams
5.2. UML Diagrams
5.2.1. Use Case Diagram
5.2.2. State Chart Diagram
5.2.3. Activity Diagram
6. Project Plan………………………………………………………………………………… 24
6.1. Project Flow
6.2. Development Plan
6.3. Project Timeline
6.4. RMMM Plan

7. Implementation……………………………………………………………………………… 28
7.1. System Model
7.2. Process Flow
7.3. Section of Attributes
7.4. Algorithm
7.5. Code
7.6. Deployment on AWS EC2
7.7. Monitoring Analysis
8. Result and Analysis…………………………………………………………………………. 44
8.1 Results
8.2 Analysis
8.3 Distributed View
8.4 Time Estimation
9. Conclusion and Future Scope………………………………………………………………. 50
10. References…………………………………………………………………………….…… 51

1. Introduction
1.1 Introduction to Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other
than to those in other groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of
unlabeled data.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
We can show this with a simple graphical example:
Fig1.1 Clusters
In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance: two or more objects belong to the same cluster if they are
“close” according to a given distance (in this case geometrical distance). This is called
distance-based clustering.

Another kind of clustering is conceptual clustering: two or more objects belong to the
same cluster if this one defines a concept common to all that objects. In other words,
objects are grouped according to their fit to descriptive concepts, not according to simple
similarity measures.
Clustering Methods:
1. Density-Based Methods:
These methods consider the clusters as the dense region having some similarity and
different from the lower dense region of the space. These methods have good
accuracy and ability to merge two clusters. Example DBSCAN (Density-Based
Spatial Clustering of Applications with Noise),OPTICS (Ordering Points to Identify
Clustering Structure) etc.
2. Hierarchical Based Methods:
The clusters formed in this method forms a tree type structure based on the hierarchy.
New clusters are formed using the previously formed one.
Examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies) etc.
3. Partitioning Methods:
These methods partition the objects into k clusters and each partition forms one
cluster. This method is used to optimize an objective criterion similarity function such
as when the distance is a major parameter example K-means, CLARANS (Clustering
Large Applications based upon randomized Search) etc.
4. Grid-based Methods:
In this method the data space is formulated into a finite number of cells that form a
grid-like structure. All the clustering operation done on these grids are fast and
independent of the number of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (Clusteringin Quest) etc.

1.2 Introduction to Cloud Environment
Cloud Computing refers to
manipulating, configuring, and
accessing the hardware and software
resources remotely. It offers online
data storage, infrastructure, and
application.
Cloud computing offers platform
independency, as the software is not
required to be installed locally on the
PC. Hence, the Cloud Computing is making our business applications mobile and
collaborative.
There are certain services and models working behind the scene making the cloud
computing feasible and accessible to end users. Following are the working models for
cloud computing:
Deployment Models
Deployment models define the type of
access to the cloud, i.e., how the cloud is
located? Cloud can have any of the four
types of access: Public, Private, Hybrid,
and Community.
PUBLIC CLOUD
The public cloud allows systems and services to be easily accessible to the general
public. Public cloud may be less secure because of its openness.
PRIVATE CLOUD
The private cloud allows systems and services to be accessible within an organization. It
is more secured because of its private nature.

COMMUNITY CLOUD
The community cloud allows systems and services to be accessible by a group of
organizations.
HYBRID CLOUD
The hybrid cloud is a mixture of public and private cloud, in which the critical activities
are performed using private cloud while the non-critical activities are performed using
public cloud.
1.3 Overview on Diabetes
Diabetes is a disease that occurs when the insulin production in the body is inadequate or
the body is unable to use the produced insulin in a proper manner, as a result, this leads
to high blood glucose. The body cells break down the food into glucose and this glucose
needs to be transported to all the cells of the body. The insulin is the hormone that directs
the glucose that is produced by breaking down the food into the body cells.
Any change in the production of insulin leads to an increase in the blood sugar levels and
this can lead to damage to the tissues and failure of the organs. Generally, a person is
considered to be suffering from diabetes, when blood sugar levels are above normal (4.4
to 6.1 mmol/L).
Types of Diabetes:
The three main types of diabetes are described below:
1. Type 1
Though there are only about 10% of diabetes patients have this form of diabetes,
recently, there has been a rise in the number of cases of this type in the United
States. The disease manifest as an autoimmune disease occurring at a very young
age of below 20 years hence also called juvenile-onset diabetes.
2. Type 2
This type accounts for almost 90% of the diabetes cases and commonly called the
adult-onset diabetes or the non-insulin dependent diabetes. In this case the various
organs of the body become insulin resistant, and this increases the demand for
insulin. At this point, pancreas doesn’t make the required amount of insulin. To

keep this type of diabetes at bay, the patients have to follow a strict diet, exercise
routine and keep track of the blood glucose.
3. Gestational diabetes
A type of diabetes that tends to occur in pregnant women due to the high sugar
levels as the pancreas don’t produce sufficient amount of insulin. Taking no
treatment can lead to complications during childbirth. Controlling the diet and
taking insulin can control this form of diabetes.
Symptoms, Diagnosis and Treatment:
The common symptoms of a person suffering from diabetes are:
· Polyuria (frequent urination)
· Polyphagia (excessive hunger)
· Polydipsia (excessive thirst)
· Weight gain or strange weight loss
· Healing of wounds is not quick, blurred vision, fatigue, itchy skin, etc.
Urine test and blood tests are conducted to detect diabetes by checking for excess body
glucose. The commonly conducted tests for determining whether a person has diabetes or
not are
· A1C Test
· Fasting Plasma Glucose (FPG) Test
· Oral Glucose Tolerance Test (OGTT).

1.4 Data flow diagram showing CFBA’s statistical details
Fig 1.2 Data Flow

2. Literature Review
Designing of predictive models for diabetes diagnosis has been an active research sector
for the past decade. Most of the models found in various literatures are based on different
clustering algorithms and different modeling techniques.
[1] Preeti Mulay (2016) ‘Threshold Computation to Discover Cluster Structure, a
New
Approach’, International Journal of Electrical and Computer Engineering (IJECE)
Vol. 6, No. 1, pp. 275~282.
With the spurt of data in all domains (almost), it is essential to have modernized data
exploratory methods, like incremental-clustering, cluster analysis, incremental-learning
etc. to name a few. These methods are useful in varied applications which require
handling influx of new data consistently and to perform forecasting, decision making and
predictions. The purpose of this research paper is to broaden the abilities of “Incremental
clustering using Naïve Bays and Closeness-Factor” (ICNBCF) [6] algorithm, and
introduce set of activities at post clustering phase. These activities include validating
cluster structures. These modifications proved the enhancements in resulting cluster
structures.
[2] Preeti Mulay and Kulkarni, P.A. (2013) ‘Knowledge augmentation via
incremental clustering: new technology for effective knowledge management’, Int. J.
Business Information Systems, Vol. 12, No. 1, pp.68–87.
This research paper uses a new statistical, error-based incremental clustering algorithm
CFBA. It discusses about knowledge augmentation and incremental learning using
various datasets. Finally, we give a computational learning theoretic perspective on
incremental learning. From this research the attempt was to look into incremental
clustering in different and new way. The Software project development is carried out in
various phases. Each phase will have planned and actual data details. The alphanumeric
attributes can be converted into complete numeric dataset by applying weight-assignment
algorithm. This converted numeric dataset is given as input to our proposed, new,
statistical, error-based, almost parameter-free algorithm named ‘closeness factor-based
algorithm’ (CFBA).

3. Technical Requirement
3.1 Tools and Software Used
I.SPYDER
• Spyder is a powerful scientific environment written in
Python, for Python, and designed by and for scientists,
engineers and data analysts.
• It offers a unique combination of the advanced editing,
analysis, debugging, and profiling functionality of a
comprehensive development tool with the data
exploration, interactive execution, deep inspection, and beautiful
visualization capabilities of a scientific package.
• Beyond its many built-in features, its abilities can be extended even
further via its plugin system and API. Furthermore, Spyder can also be
used as a PyQt5 extension library, allowing developers to build upon its
functionality and embed its components, such as the interactive console, in
their own PyQt software.
II. PUTTY
• PuTTY is a client program for the SSH, Telnet and
Rlogin network protocols.
• These protocols are all used to run a remote session on
a computer, over a network. PuTTY implements the
client end of that session: the end at which the session is displayed, rather
than the end at which it runs.
• PuTTY has been ported to various other operating systems. Official ports
are available for some Unix-like platforms, with work-in-progress ports to
Classic Mac OS and macOS, and unofficial ports have been contributed to
platforms such as Symbian, Windows Mobile and Windows Phone.

III. WinSCP
• WinSCP is a free file transfer tool for Windows
that supports FTP, SFTP and SCP. It provides a
Windows Explorer style interface that lets you
drag and drop files or folders between local and
remote locations.
• Its main function is secure file transfer between a local and a remote
computer. Beyond this, WinSCP offers basic file manager and file
synchronization functionality.
• For secure transfers, it uses Secure Shell (SSH) and supports the SCP
protocol in addition to SFTP.
IV. Microsoft MS Excel
• Microsoft Excel provides a grid interface to organize
nearly any type of information.
• The learning curve for Excel is very short, so it's easy to use Excel and be
productive right away. Rare are the situations where IT staff creates
spreadsheets, information workers can do for themselves.
• Excel makes it easy to store data, perform numerical calculations, format
cells, and adjust layouts to generate the output and reports to share with
others. Advanced features such as: subtotals, power pivot tables and pivot
charts, analysis toolkit, and many templates make it easy to accomplish a
wide range of tasks.
• It can even integrate with the Analytic Services (Business Intelligence)
from SQL Server. Tweaking the results is also very easy to get the exact
layout, fonts, colors etc. that you want.

3.2 Technologies and Languages Used
I.PYTHON
• Python is a full-fledged all-round language. It's an
interpreted, interactive, object-oriented, extensible
programming language.
• It has efficient high-level data structures and a
simple but effective approach to object-oriented
programming. Python’s elegant syntax and dynamic
typing, together with its interpreted nature, make it
an ideal language for scripting and rapid application development in many
areas on most platforms.
• Python supports multiple programming paradigms, including object-
oriented, imperative and functional programming or procedural styles. It
features a dynamic type system and automatic memory management and
has a large and comprehensive standard library.
• Python is widely used in Artificial Intelligence, Natural Language
Generation, Neural Networks and other advanced fields of Computer
Science. Python had deep focus on code readability & this class will teach
you python from basics.
.
II. AMAZON WEB SERVICES
Amazon Web Services (AWS) is a
subsidiary of Amazon that provides
on-demand cloud computing
platforms to individuals, companies
and governments, on a metered pay-
as-you-go basis. In aggregate, these
cloud computing web services provide
a set of primitive, abstract technical infrastructure and distributed
computing building blocks and tools. One of these services is Amazon
Elastic Compute Cloud, which allows users to have at their disposal a
virtual cluster of computers, available all the time, through the Internet.
AWS's version of virtual computers emulate most of the attributes of a
real computer including hardware (CPU(s) & GPU(s) for processing,
local/RAM memory, hard-disk/SSD storage); a choice of operating
systems; networking; and pre-loaded application software such as web
servers, databases, CRM, etc.

III. AMAZON ELASTIC COMPUTE CLOUD
Amazon Elastic Compute Cloud
(Amazon EC2) is a web service that
provides secure, resizable compute
capacity in the cloud. It is designed to
make web-scale cloud computing easier
for developers.
Amazon EC2’s simple web service
interface allows you to obtain and
configure capacity with minimal friction. It provides you with complete
control of your computing resources and lets you run on Amazon’s proven
computing environment. Amazon EC2 reduces the time required to obtain
and boot new server instances to minutes, allowing you to quickly scale
capacity, both up and down, as your computing requirements change.
Amazon EC2 changes the economics of computing by allowing you to
pay only for capacity that you actually use. Amazon EC2 provides
developers the tools to build failure resilient applications and isolate them
from common failure scenarios.
IV. CLOUD WATCH
• AWS CloudWatch is a monitoring and management service built for
developers, system operators, and IT managers.
• It monitors your Amazon Web Services (AWS) resources and the
applications you run on AWS in real time
• To collect and track metrics, collect and monitor log files, set alarms, and
automatically react to changes in your AWS resources.
• To set high-resolution alarms, take automated actions, solve troubleshoot
issues, and discover insights to optimize your applications and ensure they
are running smoothly.

4. Software Requirement and Analysis
4.1 Introduction
This part of the document is a comprehensive description of the intended purpose of the
research paper. This fully describes what the “Closeness Factor Based Algorithm”,
CFBA will do when applied to a large consistent set of data and how it helps generate
clusters (using basic clustering and incremental clustering approach) which upon further
analysis will help in finding hidden relationships/patterns in the data. This document also
enlists enough and necessary requirements that are required for the research project.
This is a description of how large data can be mined using data mining techniques like
clustering and includes a set of use cases that describe how raw data is transformed into
useful information.
4.2 External Interface Requirements
4.2.1 User Interfaces
This is a research project with the objective of analyzing the datasets generated,
by using a clustered data mining algorithm to create clusters and to find hidden
relations/patterns between various parameters of a diabetic patient. Therefore, at
this stage no particular User Interface has been defined.
4.2.2 Hardware Interfaces
4.2.3 Software Interfaces
Closeness factor-based Algorithm (CFBA) -
• Initial phase - Completely new dataset form basic clusters.

• Incremental phase - It compares closeness value with already formed
clusters (Centre of cluster) and take decision of appending matching
clusters.
• Final phase - Updated set of clusters is made available to analyst for
taking decision based on augmented knowledge.
For implementing the Closeness factor-based Algorithm (CFBA) Putty software
is required with -
• Version 0.71 or advance
• Built platform 64-bit x86 windows

5. High Level Design
5.1 Entity Relationship Diagram
Fig 5.1 Entity Relationship Diagram

5.2 UML Diagrams
5.2.1 Use Case Diagram
Fig 5.2 Use Case Diagram

5.2.2 State Chart Diagram
Fig 5.3 State Chart Diagram

5.2.3 Activity Diagram
Fig 5.4 Activity Diagram

6. Project Plan
6.1 Project Flow
Fig 6.1 Project Flow
6.2 Development Plan
Fig 6.2 Basic Clustering Algorithm Development Plan

Fig 6.3 Incremental Clustering Algorithm Development Plan

6.3 Project Timeline
Fig 6.4 Project Timeline

6.4 RMMM Plan
The goal of the risk mitigation, monitoring and management plan is to identify
as many potential risks as possible.
Risk Description
• Project Risks: Identifies potential schedule, personnel, resource, and
requirements problems and their impact on the project. It threatens the project
plan. That is, if project risks become real, it is likely that project schedule will
slip and that costs will increase.
• Technical Risks: Identifies potential design, implementation, interface,
verification, and maintenance problems. Technical risks threaten the quality
and timeliness of the software to be produced. If a technical risk becomes a
reality, Implementation may become difficult or impossible.
• Network Risks: This includes the network failure and other network related
risks.
• Support Risk: The degree of uncertainty that the resultant software will be
easy to correct, adapt, and enhance.

7. Implementation
7.1 System Model
Fig 7.1 System Model

7.2 Process Flow
The model used below is a modification of general water-flow process model which
presents a basic architectural flow of our research project. This model contains four
stages namely: topic exploration, data collection and exploration, data cleansing,
exploratory data mining and lately integration and verification. Each of these stages are
detailed below –
Fig 7.2 Process Flow

Step-1: Topic exploration
· Subject and topic selection: Selection of subject for research and providing a brief
detail behind its selection.
· Survey of available literature: Check the availability and legitimacy of necessary data
and facts needed to be mined for information generation.
· Review background information: Use of research guides and websites with respect to
the subject of research to gather necessary information in order to move ahead in the right
direction.
· Refine research matter and develop initial hypothesis: Creating a detailed question
from the information gathered on the selected topic in order to develop an initial
hypothesis which will be answered by this research paper.
Step-2: Data collection and explore focused information
· To collect information in a more focused manner based on the research problem. This
involves understanding and uncovering hidden relations and important aspects in order to
determine the various data parameters needed to create data sets which will be mined for
information in the later stages. Collection of data sets: Collecting data from professional
clinics and hospitals as well as from online available data warehouses to generate data sets
for the objective stated.
Step-3: Extraction, cleansing of data for analysis
• This stage involves checking consistency and validity of data gathered by
analyzing and removing data sets which depict inconsistencies such as: missing
data parameter, invalid data parameters.
Step-4: Exploratory data mining
• This step involves searching or developing data mining algorithms which will help
confirm the hypothesis stated for the research project. Here, use of data mining
tools take place in order to use or build algorithms to be used on the data sets
collected.
Step-5: Integration and verification
• This step is where the data mining algorithm is applied to the collected and pre-
processed data sets to generate beneficial information which will help confirm the
initial hypothesis stated for the research project and provide a basis of developing
hidden essential relationships among the parameters used.

7.3 Selection of Attributes
CFBA and Incremental Algorithm was applied to the diabetes patient’s datasets taking
into account 8 attributes. These attributes were chosen based on common occurrence in
every diabetic patient’s report and their relative importance with respect to factors
triggering the disease.
The input attributes being utilized as parameters in the study are:
Sr. No Attribute Description
1. Pregnancies Number of times pregnant
2. Glucose Plasma glucose concentration 2 hours in an oral glucose
tolerance test
3. BloodPressure Diastolic blood pressure (mm Hg)
4. SkinThickness Triceps skin fold thickness (mm)
5. Insulin 2-Hour serum insulin (mu U/ml)
6. BMI Body mass index (weight in kg/ (height in m) ^2)
7. DiabetesPedigreeFunction Diabetes pedigree function
8. Age Age (years)
7.4 Algorithm
The present work intends to implement an incremental clustering algorithm in order to
analyze its results, this algorithm is a modified version of CFBA (Closeness Factor based
algorithm) which gives new findings in terms of grouping patients with similar diabetic
parameters thus make doctor’s task easy in providing treatment.
Algorithm is implemented using python programming on the Anaconda IDE (Spyder
version 3.2.8) and putty(Version 0.71) after AWS deployment.
7.4.1 Basic clustering Algorithm:
1. Enter the name of dataset.csv file to be imported.
2. For all the values in dataset
a) Calculate the row total
b) Calculate the column total

3. Calculate the submission of row totals and store it in a variable example - sum
4. For all the tuples / rows
a) Calculate Probability using formula
Probability = (row_total) (row_total + next_row_total)
b) Calculate Error using the probability (calculated in 4(a)) and submission of
row totals (calculated in 3)
error = ( probability * sum ) / sqrt( sum * probability * ( 1 -
probability))
c) Calculate Weight using row total (calculated in 2(a))
weight = sqrt(row_total)
5. For calculating Closeness factor each tuple is paired with rest of the tuples through
looping and then each pair is formulated in order to calculate closeness factor for that
particular pair using below operations
a) Calculate the submission of both the tuples for each parameter
b) Calculate the Error for each parameter
ex1 = probability of primary tuple * parameter submission
(Calculated in 5(a))
ex2 = primary tuple parametric value
ex3 = probability of primary tuple * parameter submission
(Calculated in 5(a))
ex4 = 1 - probability of primary tuple
Error = (ex1 - ex2) / (sqrt(ex3 * ex4)))
c) Calculate Error square for using Error (Calculated in 5(b))
d) Calculate Weight using parameter submission (Calculated in 5(a))

Weight = sqrt(parameter_submission)
e) Calculate multiplication of Error square (Calculated in 5(c)) and
Weight (Calculated in 5(d))
Mul = Error_square * Weight
f) Calculate Closeness Factor
Closeness Factor = (submission of Mul (Calculated in 5(e)) of
all parameters) / (submission of Weight
(Calculated in 5(d)) of all parameters)
Repeat all the above operation in 5 for all the tuples in the data set.
6. Attach the index with each Closeness factor value in order to identify the pair of
tuples for which Closeness Factor is being calculated.
7. After the indexing, sort the Closeness Factor value in ascending order.
8. Determine the Number of clusters and the range of clusters explicitly.
9. Compare the sorted Closeness Factor value with the ranges and store the values with
index accordingly into the clusters for that particular range.
10. Display all the above data into and excel file or csv file or in the same dataset.csv file
in different sheets.
11. End
In order to process additional dataset, incremental algorithm is being used which is the
extended version of Basic clustering algorithm.
7.4.2 Incremental clustering Algorithm:
1. Enter the name of Additional_dataset.csv file to be imported.
2. Perform all the steps mentioned above in basic clustering algorithm from 1 to 7.

3. After performing above operations, instead of explicitly defining the new cluster
ranges compare the new Closeness Factor values for the additional dataset with the
predefined ranges and append the values into clusters accordingly and if the values
don’t fit in any of the ranges then a new cluster is created and values will get stored
and appended into that cluster.
4. Display the above data into same excel file or csv file.
5. End
7.5 Code
According to the algorithm described above, we have implemented code in the python
language which is available on the Git Repository-
https://github.com/avinashbarfa/Increamental-Clustering
Fig 7.3 Basic Clustering

Fig7.4 Incremental Clustering
• Input File Format – CSV (Comma Separated Values)
• Output File Format – XLSX (Excel Workbook)
o ClusterData.xlsx – Contains Clusters
o Basic_Diabetes_data.xlsx – Contains parameters calculated for clustering.
o Incremenatl_Diabetes_data.xlsx – Contains parameters calculated for incremental
clustering.
7.6 Deployment on AWS EC2
At this stage we are finally going to achieve the project goal. After deployment and testing the
code on the local machine we have finally deployed it on the Amazon EC2 instance.
We have Selected Following Amazon Machine Image (AMI) and EC2 Configuration
Operating System – Ubuntu 18.04 LTS (64-bit)
Volume Type – SSD
Following are the AWS resource status that we are using in the project.

Fig7.5 AMS Resource Status
After launching the instance, we will use Putty for accessing the instance through the
floating/public IP address and Security key provided by amazon while creating instance.
Fig7.6 AMS Service Health

Fig7.7 Instance Accessing Using Putty
Now, we have to configure the instance to run our algorithm i.e. we have to install python3 and
its library NumPy, Panda & excel writer. Following are the commands in sequence that to be run
on instance.
sudo apt-get update
sudo apt install python3-pip
sudo apt install python3-numpy
sudo apt install python3-pandas
sudo apt install python3-pandas
sudo apt install python3-openpyxl
Now our instance is configured as per our requirement. Now we will upload our Python
Algorithm on to the EC2 machine using WinSCP.
We will upload the Python File and Data Sheet (in CSV format) on the machine into a
newly created directory ProjectWork.

Fig7.8 WinSCP Instance File Structure
Fig7.9 Putty Instance File Structure

After entering in the ProjectWork directory in the Putty, we type command
python3 pythonfile to execute the algorithm.
Fig7.10 Command to Run Python File
After hitting enter code begin to executes on the Amazon EC2 Instance, following are the
screenshots while code executes.
Fig7.11 Python Code Execution -1

Fig7.12 Python Code Execution -2

7.7 Monitoring Analysis
Amazon provides its monitoring service CloudWatch that enables the user to monitor the
instance while it is in active state.
CloudWatch can be accessed through the Amazon AWS console page.
Fig7.13 AWS Monitoring Console -1
Fig7.14 AWS Monitoring Console -2

Below are the screenshots of the CPU Utilization, Networks that are taken using CloudWatch
while our code executes.
Fig7.15 CPU Utilization at Starting stage
Fig7.16 CPU Utilization at Intermediate Stage

Fig7.17 Network In status
Fig7.18 Network Out Status

8. Result & Analysis
8.1 Results
This section shows the results obtained after executing the CFBA algorithm on
the diabetes dataset referred from Kaggle (PIMA Dataset) and used in the project.
The dataset consists of 8 attributes and 768 instances. All patients belong to the
age group of 20-70 years of Indian heritage.
The study found out relations between various parameters and carried out cluster
analysis wherein we identify which parameters are the underlying causes for a
data series to be an outlier.
The observations are as follows –
1) Finding different parameters, for CFBA Analysis:
After reading the CSV file, we have calculated the row total, column total and
further, the values of parameters – Probability, Error and Weight.
Fig 8.1 Required Parameters

2) Using CFBA, finding the CF values:
Further, for the CFBA algorithm, we have calculated the parameters – Error,
Weight and their product. This is used to calculate the Closeness Factors for the
data series.
Fig 8.2 Closeness Factor
8.2 Analysis
The aim of our project can be simply broken down into two primary objectives:
1. Organizing the closeness values according to specific ranges.
2. Generating clusters, according to the above ranges.
We have analyzed two datasets PIMA Diabetes and WINE with this Distributed
incremental clustering algorithm and additional datasets can also be processed and
analyzed. Primary dataset which was taken into consideration was PIMA Diabetes
dataset.

1. Organizing the closeness values according to specific ranges:
Here, we have defined custom ranges for the closeness values –
1) Cluster 1 – From 0.0001 to 0.066151499
2) Cluster 2 – From 0.066151499 to 0.106095987
3) Cluster 3 – From 0.106095987 to 0.156869097
4) Cluster 4 – Above 0.156869097
2. Generating clusters, according to the above ranges:
Fig 8.3 Cluster 1

Fig 8.4 Cluster 2
Fig 8.5 Cluster 3

Fig 8.6 Cluster 4
8.3 Distributed View
This Incremental clustering project is working on various multiple Instances hence helps
in achieving Elastic web-scale computing, completely control and flexibility in cloud
hosting. In the below example we have shown one Master instance with two slave
instances.

8.4 Time Estimation
Tuples (No. of Rows) In Local Machine (Approx.) In AWS(Approx.)
100 5 Minutes 49 Seconds 5 Minutes 43 Seconds
200 44 Minutes 31 Seconds 38 Minutes 19 Seconds
300 4 hours 47 Minutes 21 Seconds 4 hours 16 Minutes 36 Seconds
400 11 hours 36 Minute 45 Seconds 11 hours 03 Minutes 17 Seconds
500 1 day 02 hours 32 Minutes 1 day 0 hours 11 Minutes
650 1 day 16 hours 1 day 13 hours
767 2 days 6 hours 2 days 1 hours

9. Conclusion and Future Scope
Our algorithm’s detection capabilities of high impact parameters in its early stages can
prove to be the key for treatment and prevention. Our algorithm is very helpful for the
doctors and endocrinologist. An endocrinologist is a doctor who specializes in treating
diabetes.
This shows how incremental clustering is used to model actual diagnosis of diabetes for
local and systematic treatment, this algorithm collects and analyses medical record of
diabetic patient with knowledge discovery techniques to extract the information. We used
the CFBA algorithm to find various parameters like the probability, error, closeness.
Using the parameters of the CFBA algorithm we are creating clusters according to the
closeness value and to incorporate new data we have implemented incremental clustering
After this we made a free tier instance. Using Putty and winSCP we deployed it on AWS
(Amazon Web Services)

10. References
• Dataset: https://www.kaggle.com/akashkr/pima-indian-diabetes
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.html
docs/stable/reference/api/pandas.DataFrame.iloc.html
docs/stable/reference/api/pandas.DataFrame.to_excel.html
• https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/concepts.html
• http://blog.adeel.io/2016/11/19/installing-pandas-scipy-numpy-and-scikit-learn-
on-aws-ec2/
• https://panda.readthedocs.io/en/latest/amazon.html
• https://www.ssh.com/ssh/putty/putty-manuals/0.68/index.html
• https://winscp.net/eng/docs/start

Increamental Data Clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Increamental Data Clustering

Similar to Increamental Data Clustering (20)

Recently uploaded

Recently uploaded (20)

Increamental Data Clustering