SlideShare a Scribd company logo
1 of 51
Download to read offline
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 1
Project Semester – 8th
January 10 to April 20, 2019
Department of Computer Science Engineering & Information Technology Symbiosis Institute of
Technology, Pune
Final Year Project Report on
Analysis of Diabetes Dataset Using Distributed Incremental Clustering
Algorithm and AWS
Submitted by:
Aditya Maheshwari 15070121503
Avinash Barfa 15070121507
Kunal Gulati 15070121525
Palash Verma 15070121140
Under the Guidance of:
Dr. Preeti Mulay Mr. Rahul Joshi
Professor, Assistant Professor,
CS and IT Department, CS and IT Department,
SIT, Pune. SIT, Pune.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 2
DECLARATION
We hereby declare that the project work entitled ―Analysis of Diabetes Dataset Using
Distributed Incremental Clustering Algorithm and AWS is an authentic record of our work
carried out at as requirements for final year project for the award of B.Tech degree in Computer
Science and Information Technology Engineering at Symbiosis Institute of Technology Pune,
affiliated to Symbiosis International Deemed University, Pune under the guidance of Dr. Preeti
Mulay and Mr. Rahul Joshi, during January 2019 to April 2019.
Aditya
Maheshwari
Avinash
Barfa
Kunal
Gulati
Palash
Verma
15070121503 15070121507 15070121525 15070121140
Date: ___________________
Certified that the above statement made by the student is correct to the best of our knowledge
and belief.
Dr. Preeti Mulay Mr. Rahul Joshi
Professor, Assistant Professor,
CS and IT Department, CS and IT Department,
SIT, Pune. SIT, Pune.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 3
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance and assistance from
many people and we are extremely fortunate to have got this all along the completion of our
project work. Whatever we have done is only due to such guidance and assistance and we would
not forget to thank them.
First and foremost, we are expressing our thankfulness and praise to Symbiosis Institute of
Technology, Pune and Department of Computer Science and Information Technology for giving
us this wonderful opportunity to undergo B.Tech Project Work, helping us to learn and attain
great experience.
We would like to thank our HOD, Dr. Shraddha Phansalkar for her valuable guidance,
supervising this work and helpful suggestions.
We owe our profound gratitude and special thanks to Dr. Preeti Mulay for cooperating with us
and giving us her valuable time and information and Prof. Rahul Joshi who in spite of being
extraordinarily busy with his duties, took time out to hear, guide and keep us on the correct path
by his untiring assistance, direction, encouragement, continuous support, valuable ideas and
constructive criticism throughout this project work.
At last we are grateful to our respected teachers Department of Computer Science and
Information Technology SIT, Pune, family and friends for their help, encouragement and co-
operation during the project work.
Aditya
Maheshwari
Avinash
Barfa
Kunal
Gulati
Palash
Verma
15070121503 15070121507 15070121525 15070121140
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 4
List of Figures
Fig 1.1 Clusters
Fig 1.2 Data Flow
Fig 5.1 Entity Relation Diagram
Fig 5.2 Use Case Diagram
Fig 5.3 State Chart Diagram
Fig 5.4 Activity Diagram
Fig 6.1 Project Flow
Fig 6.2 Basic Clustering Algorithm Dev Plan
Fig 6.3 Incremental Clustering Algorithm Dev Plan
Fig 6.4 Project Timeline
Fig 7.1 System Model
Fig 7.2 Process Flow
Fig 7.3 Basic Clustering
Fig 7.4 Incremental Clustering
Fig 7.5 AMS Resource Status
Fig 7.6 AMS Service Help
Fig 7.7 Instance Accessing using Putty
Fig 7.8 WinSCP Instance File Structure
Fig 7.9 Putty Instance File Structure
Fig 7.10 Commands to Run python file
Fig 7.11 Python code Execution 1
Fig 7.12 Python code Execution 2
Fig 7.13 AWS Monitoring Console 1
Fig 7.14 AWS Monitoring Console 2
Fig 7.15 CPU Utilization at starting stage
Fig 7.16 CPU Utilization at intermediate stage
Fig 7.17 Network in Status
Fig 7.18 Network Hour Status
Fig 8.1 Required Parameters
Fig 8.2 Closeness Factor
Fig 8.3 Cluster 1
Fig 8.4 Cluster 2
Fig 8.5 Cluster 3
Fig 8.6 Cluster 4
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 5
TABLE OF CONTENTS
1. Introduction…………………………………………………………………………………. 7
1.1. Introduction to Clustering
1.2. Introduction to Cloud Environment
1.3. Overview on Diabetes
1.4. Data flow diagram showing CFBA’s statistical details
2. Literature Review………………………………………………………………………….. 13
3. Technical Requirements…………………………………………………………………… 14
3.1. Tools and software used
3.2. Technologies and languages used
4. Software Requirement Analysis……………………………………………………………. 18
4.1. Introduction
4.2. External Interface Requirements
4.2.1. User Interface
4.2.2. Hardware Interface
4.2.3. Software Interface
5. High Level Design……………………………………………………………………….… 20
5.1. Entity Relationship Diagrams
5.2. UML Diagrams
5.2.1. Use Case Diagram
5.2.2. State Chart Diagram
5.2.3. Activity Diagram
6. Project Plan………………………………………………………………………………… 24
6.1. Project Flow
6.2. Development Plan
6.3. Project Timeline
6.4. RMMM Plan
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 6
7. Implementation……………………………………………………………………………… 28
7.1. System Model
7.2. Process Flow
7.3. Section of Attributes
7.4. Algorithm
7.5. Code
7.6. Deployment on AWS EC2
7.7. Monitoring Analysis
8. Result and Analysis…………………………………………………………………………. 44
8.1 Results
8.2 Analysis
8.3 Distributed View
8.4 Time Estimation
9. Conclusion and Future Scope………………………………………………………………. 50
10. References…………………………………………………………………………….…… 51
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 7
1. Introduction
1.1 Introduction to Clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense) to each other
than to those in other groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields, including machine
learning, pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of
unlabeled data.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
We can show this with a simple graphical example:
Fig1.1 Clusters
In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance: two or more objects belong to the same cluster if they are
“close” according to a given distance (in this case geometrical distance). This is called
distance-based clustering.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 8
Another kind of clustering is conceptual clustering: two or more objects belong to the
same cluster if this one defines a concept common to all that objects. In other words,
objects are grouped according to their fit to descriptive concepts, not according to simple
similarity measures.
Clustering Methods:
1. Density-Based Methods:
These methods consider the clusters as the dense region having some similarity and
different from the lower dense region of the space. These methods have good
accuracy and ability to merge two clusters. Example DBSCAN (Density-Based
Spatial Clustering of Applications with Noise),OPTICS (Ordering Points to Identify
Clustering Structure) etc.
2. Hierarchical Based Methods:
The clusters formed in this method forms a tree type structure based on the hierarchy.
New clusters are formed using the previously formed one.
Examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies) etc.
3. Partitioning Methods:
These methods partition the objects into k clusters and each partition forms one
cluster. This method is used to optimize an objective criterion similarity function such
as when the distance is a major parameter example K-means, CLARANS (Clustering
Large Applications based upon randomized Search) etc.
4. Grid-based Methods:
In this method the data space is formulated into a finite number of cells that form a
grid-like structure. All the clustering operation done on these grids are fast and
independent of the number of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (Clusteringin Quest) etc.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 9
1.2 Introduction to Cloud Environment
Cloud Computing refers to
manipulating, configuring, and
accessing the hardware and software
resources remotely. It offers online
data storage, infrastructure, and
application.
Cloud computing offers platform
independency, as the software is not
required to be installed locally on the
PC. Hence, the Cloud Computing is making our business applications mobile and
collaborative.
There are certain services and models working behind the scene making the cloud
computing feasible and accessible to end users. Following are the working models for
cloud computing:
Deployment Models
Deployment models define the type of
access to the cloud, i.e., how the cloud is
located? Cloud can have any of the four
types of access: Public, Private, Hybrid,
and Community.
PUBLIC CLOUD
The public cloud allows systems and services to be easily accessible to the general
public. Public cloud may be less secure because of its openness.
PRIVATE CLOUD
The private cloud allows systems and services to be accessible within an organization. It
is more secured because of its private nature.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 10
COMMUNITY CLOUD
The community cloud allows systems and services to be accessible by a group of
organizations.
HYBRID CLOUD
The hybrid cloud is a mixture of public and private cloud, in which the critical activities
are performed using private cloud while the non-critical activities are performed using
public cloud.
1.3 Overview on Diabetes
Diabetes is a disease that occurs when the insulin production in the body is inadequate or
the body is unable to use the produced insulin in a proper manner, as a result, this leads
to high blood glucose. The body cells break down the food into glucose and this glucose
needs to be transported to all the cells of the body. The insulin is the hormone that directs
the glucose that is produced by breaking down the food into the body cells.
Any change in the production of insulin leads to an increase in the blood sugar levels and
this can lead to damage to the tissues and failure of the organs. Generally, a person is
considered to be suffering from diabetes, when blood sugar levels are above normal (4.4
to 6.1 mmol/L).
Types of Diabetes:
The three main types of diabetes are described below:
1. Type 1
Though there are only about 10% of diabetes patients have this form of diabetes,
recently, there has been a rise in the number of cases of this type in the United
States. The disease manifest as an autoimmune disease occurring at a very young
age of below 20 years hence also called juvenile-onset diabetes.
2. Type 2
This type accounts for almost 90% of the diabetes cases and commonly called the
adult-onset diabetes or the non-insulin dependent diabetes. In this case the various
organs of the body become insulin resistant, and this increases the demand for
insulin. At this point, pancreas doesn’t make the required amount of insulin. To
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 11
keep this type of diabetes at bay, the patients have to follow a strict diet, exercise
routine and keep track of the blood glucose.
3. Gestational diabetes
A type of diabetes that tends to occur in pregnant women due to the high sugar
levels as the pancreas don’t produce sufficient amount of insulin. Taking no
treatment can lead to complications during childbirth. Controlling the diet and
taking insulin can control this form of diabetes.
Symptoms, Diagnosis and Treatment:
The common symptoms of a person suffering from diabetes are:
· Polyuria (frequent urination)
· Polyphagia (excessive hunger)
· Polydipsia (excessive thirst)
· Weight gain or strange weight loss
· Healing of wounds is not quick, blurred vision, fatigue, itchy skin, etc.
Urine test and blood tests are conducted to detect diabetes by checking for excess body
glucose. The commonly conducted tests for determining whether a person has diabetes or
not are
· A1C Test
· Fasting Plasma Glucose (FPG) Test
· Oral Glucose Tolerance Test (OGTT).
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 12
1.4 Data flow diagram showing CFBA’s statistical details
Fig 1.2 Data Flow
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 13
2. Literature Review
Designing of predictive models for diabetes diagnosis has been an active research sector
for the past decade. Most of the models found in various literatures are based on different
clustering algorithms and different modeling techniques.
[1] Preeti Mulay (2016) ‘Threshold Computation to Discover Cluster Structure, a
New
Approach’, International Journal of Electrical and Computer Engineering (IJECE)
Vol. 6, No. 1, pp. 275~282.
With the spurt of data in all domains (almost), it is essential to have modernized data
exploratory methods, like incremental-clustering, cluster analysis, incremental-learning
etc. to name a few. These methods are useful in varied applications which require
handling influx of new data consistently and to perform forecasting, decision making and
predictions. The purpose of this research paper is to broaden the abilities of “Incremental
clustering using Naïve Bays and Closeness-Factor” (ICNBCF) [6] algorithm, and
introduce set of activities at post clustering phase. These activities include validating
cluster structures. These modifications proved the enhancements in resulting cluster
structures.
[2] Preeti Mulay and Kulkarni, P.A. (2013) ‘Knowledge augmentation via
incremental clustering: new technology for effective knowledge management’, Int. J.
Business Information Systems, Vol. 12, No. 1, pp.68–87.
This research paper uses a new statistical, error-based incremental clustering algorithm
CFBA. It discusses about knowledge augmentation and incremental learning using
various datasets. Finally, we give a computational learning theoretic perspective on
incremental learning. From this research the attempt was to look into incremental
clustering in different and new way. The Software project development is carried out in
various phases. Each phase will have planned and actual data details. The alphanumeric
attributes can be converted into complete numeric dataset by applying weight-assignment
algorithm. This converted numeric dataset is given as input to our proposed, new,
statistical, error-based, almost parameter-free algorithm named ‘closeness factor-based
algorithm’ (CFBA).
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 14
3. Technical Requirement
3.1 Tools and Software Used
I.SPYDER
• Spyder is a powerful scientific environment written in
Python, for Python, and designed by and for scientists,
engineers and data analysts.
• It offers a unique combination of the advanced editing,
analysis, debugging, and profiling functionality of a
comprehensive development tool with the data
exploration, interactive execution, deep inspection, and beautiful
visualization capabilities of a scientific package.
• Beyond its many built-in features, its abilities can be extended even
further via its plugin system and API. Furthermore, Spyder can also be
used as a PyQt5 extension library, allowing developers to build upon its
functionality and embed its components, such as the interactive console, in
their own PyQt software.
II. PUTTY
• PuTTY is a client program for the SSH, Telnet and
Rlogin network protocols.
• These protocols are all used to run a remote session on
a computer, over a network. PuTTY implements the
client end of that session: the end at which the session is displayed, rather
than the end at which it runs.
• PuTTY has been ported to various other operating systems. Official ports
are available for some Unix-like platforms, with work-in-progress ports to
Classic Mac OS and macOS, and unofficial ports have been contributed to
platforms such as Symbian, Windows Mobile and Windows Phone.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 15
III. WinSCP
• WinSCP is a free file transfer tool for Windows
that supports FTP, SFTP and SCP. It provides a
Windows Explorer style interface that lets you
drag and drop files or folders between local and
remote locations.
• Its main function is secure file transfer between a local and a remote
computer. Beyond this, WinSCP offers basic file manager and file
synchronization functionality.
• For secure transfers, it uses Secure Shell (SSH) and supports the SCP
protocol in addition to SFTP.
IV. Microsoft MS Excel
• Microsoft Excel provides a grid interface to organize
nearly any type of information.
• The learning curve for Excel is very short, so it's easy to use Excel and be
productive right away. Rare are the situations where IT staff creates
spreadsheets, information workers can do for themselves.
• Excel makes it easy to store data, perform numerical calculations, format
cells, and adjust layouts to generate the output and reports to share with
others. Advanced features such as: subtotals, power pivot tables and pivot
charts, analysis toolkit, and many templates make it easy to accomplish a
wide range of tasks.
• It can even integrate with the Analytic Services (Business Intelligence)
from SQL Server. Tweaking the results is also very easy to get the exact
layout, fonts, colors etc. that you want.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 16
3.2 Technologies and Languages Used
I.PYTHON
• Python is a full-fledged all-round language. It's an
interpreted, interactive, object-oriented, extensible
programming language.
• It has efficient high-level data structures and a
simple but effective approach to object-oriented
programming. Python’s elegant syntax and dynamic
typing, together with its interpreted nature, make it
an ideal language for scripting and rapid application development in many
areas on most platforms.
• Python supports multiple programming paradigms, including object-
oriented, imperative and functional programming or procedural styles. It
features a dynamic type system and automatic memory management and
has a large and comprehensive standard library.
• Python is widely used in Artificial Intelligence, Natural Language
Generation, Neural Networks and other advanced fields of Computer
Science. Python had deep focus on code readability & this class will teach
you python from basics.
.
II. AMAZON WEB SERVICES
Amazon Web Services (AWS) is a
subsidiary of Amazon that provides
on-demand cloud computing
platforms to individuals, companies
and governments, on a metered pay-
as-you-go basis. In aggregate, these
cloud computing web services provide
a set of primitive, abstract technical infrastructure and distributed
computing building blocks and tools. One of these services is Amazon
Elastic Compute Cloud, which allows users to have at their disposal a
virtual cluster of computers, available all the time, through the Internet.
AWS's version of virtual computers emulate most of the attributes of a
real computer including hardware (CPU(s) & GPU(s) for processing,
local/RAM memory, hard-disk/SSD storage); a choice of operating
systems; networking; and pre-loaded application software such as web
servers, databases, CRM, etc.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 17
III. AMAZON ELASTIC COMPUTE CLOUD
Amazon Elastic Compute Cloud
(Amazon EC2) is a web service that
provides secure, resizable compute
capacity in the cloud. It is designed to
make web-scale cloud computing easier
for developers.
Amazon EC2’s simple web service
interface allows you to obtain and
configure capacity with minimal friction. It provides you with complete
control of your computing resources and lets you run on Amazon’s proven
computing environment. Amazon EC2 reduces the time required to obtain
and boot new server instances to minutes, allowing you to quickly scale
capacity, both up and down, as your computing requirements change.
Amazon EC2 changes the economics of computing by allowing you to
pay only for capacity that you actually use. Amazon EC2 provides
developers the tools to build failure resilient applications and isolate them
from common failure scenarios.
IV. CLOUD WATCH
• AWS CloudWatch is a monitoring and management service built for
developers, system operators, and IT managers.
• It monitors your Amazon Web Services (AWS) resources and the
applications you run on AWS in real time
• To collect and track metrics, collect and monitor log files, set alarms, and
automatically react to changes in your AWS resources.
• To set high-resolution alarms, take automated actions, solve troubleshoot
issues, and discover insights to optimize your applications and ensure they
are running smoothly.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 18
4. Software Requirement and Analysis
4.1 Introduction
This part of the document is a comprehensive description of the intended purpose of the
research paper. This fully describes what the “Closeness Factor Based Algorithm”,
CFBA will do when applied to a large consistent set of data and how it helps generate
clusters (using basic clustering and incremental clustering approach) which upon further
analysis will help in finding hidden relationships/patterns in the data. This document also
enlists enough and necessary requirements that are required for the research project.
This is a description of how large data can be mined using data mining techniques like
clustering and includes a set of use cases that describe how raw data is transformed into
useful information.
4.2 External Interface Requirements
4.2.1 User Interfaces
This is a research project with the objective of analyzing the datasets generated,
by using a clustered data mining algorithm to create clusters and to find hidden
relations/patterns between various parameters of a diabetic patient. Therefore, at
this stage no particular User Interface has been defined.
4.2.2 Hardware Interfaces
4.2.3 Software Interfaces
Closeness factor-based Algorithm (CFBA) -
• Initial phase - Completely new dataset form basic clusters.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 19
• Incremental phase - It compares closeness value with already formed
clusters (Centre of cluster) and take decision of appending matching
clusters.
• Final phase - Updated set of clusters is made available to analyst for
taking decision based on augmented knowledge.
For implementing the Closeness factor-based Algorithm (CFBA) Putty software
is required with -
• Version 0.71 or advance
• Built platform 64-bit x86 windows
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 20
5. High Level Design
5.1 Entity Relationship Diagram
Fig 5.1 Entity Relationship Diagram
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 21
5.2 UML Diagrams
5.2.1 Use Case Diagram
Fig 5.2 Use Case Diagram
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 22
5.2.2 State Chart Diagram
Fig 5.3 State Chart Diagram
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 23
5.2.3 Activity Diagram
Fig 5.4 Activity Diagram
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 24
6. Project Plan
6.1 Project Flow
Fig 6.1 Project Flow
6.2 Development Plan
Fig 6.2 Basic Clustering Algorithm Development Plan
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 25
Fig 6.3 Incremental Clustering Algorithm Development Plan
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 26
6.3 Project Timeline
Fig 6.4 Project Timeline
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 27
6.4 RMMM Plan
The goal of the risk mitigation, monitoring and management plan is to identify
as many potential risks as possible.
Risk Description
• Project Risks: Identifies potential schedule, personnel, resource, and
requirements problems and their impact on the project. It threatens the project
plan. That is, if project risks become real, it is likely that project schedule will
slip and that costs will increase.
• Technical Risks: Identifies potential design, implementation, interface,
verification, and maintenance problems. Technical risks threaten the quality
and timeliness of the software to be produced. If a technical risk becomes a
reality, Implementation may become difficult or impossible.
• Network Risks: This includes the network failure and other network related
risks.
• Support Risk: The degree of uncertainty that the resultant software will be
easy to correct, adapt, and enhance.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 28
7. Implementation
7.1 System Model
Fig 7.1 System Model
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 29
7.2 Process Flow
The model used below is a modification of general water-flow process model which
presents a basic architectural flow of our research project. This model contains four
stages namely: topic exploration, data collection and exploration, data cleansing,
exploratory data mining and lately integration and verification. Each of these stages are
detailed below –
Fig 7.2 Process Flow
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 30
Step-1: Topic exploration
· Subject and topic selection: Selection of subject for research and providing a brief
detail behind its selection.
· Survey of available literature: Check the availability and legitimacy of necessary data
and facts needed to be mined for information generation.
· Review background information: Use of research guides and websites with respect to
the subject of research to gather necessary information in order to move ahead in the right
direction.
· Refine research matter and develop initial hypothesis: Creating a detailed question
from the information gathered on the selected topic in order to develop an initial
hypothesis which will be answered by this research paper.
Step-2: Data collection and explore focused information
· To collect information in a more focused manner based on the research problem. This
involves understanding and uncovering hidden relations and important aspects in order to
determine the various data parameters needed to create data sets which will be mined for
information in the later stages. Collection of data sets: Collecting data from professional
clinics and hospitals as well as from online available data warehouses to generate data sets
for the objective stated.
Step-3: Extraction, cleansing of data for analysis
• This stage involves checking consistency and validity of data gathered by
analyzing and removing data sets which depict inconsistencies such as: missing
data parameter, invalid data parameters.
Step-4: Exploratory data mining
• This step involves searching or developing data mining algorithms which will help
confirm the hypothesis stated for the research project. Here, use of data mining
tools take place in order to use or build algorithms to be used on the data sets
collected.
Step-5: Integration and verification
• This step is where the data mining algorithm is applied to the collected and pre-
processed data sets to generate beneficial information which will help confirm the
initial hypothesis stated for the research project and provide a basis of developing
hidden essential relationships among the parameters used.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 31
7.3 Selection of Attributes
CFBA and Incremental Algorithm was applied to the diabetes patient’s datasets taking
into account 8 attributes. These attributes were chosen based on common occurrence in
every diabetic patient’s report and their relative importance with respect to factors
triggering the disease.
The input attributes being utilized as parameters in the study are:
Sr. No Attribute Description
1. Pregnancies Number of times pregnant
2. Glucose Plasma glucose concentration 2 hours in an oral glucose
tolerance test
3. BloodPressure Diastolic blood pressure (mm Hg)
4. SkinThickness Triceps skin fold thickness (mm)
5. Insulin 2-Hour serum insulin (mu U/ml)
6. BMI Body mass index (weight in kg/ (height in m) ^2)
7. DiabetesPedigreeFunction Diabetes pedigree function
8. Age Age (years)
7.4 Algorithm
The present work intends to implement an incremental clustering algorithm in order to
analyze its results, this algorithm is a modified version of CFBA (Closeness Factor based
algorithm) which gives new findings in terms of grouping patients with similar diabetic
parameters thus make doctor’s task easy in providing treatment.
Algorithm is implemented using python programming on the Anaconda IDE (Spyder
version 3.2.8) and putty(Version 0.71) after AWS deployment.
7.4.1 Basic clustering Algorithm:
1. Enter the name of dataset.csv file to be imported.
2. For all the values in dataset
a) Calculate the row total
b) Calculate the column total
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 32
3. Calculate the submission of row totals and store it in a variable example - sum
4. For all the tuples / rows
a) Calculate Probability using formula
Probability = (row_total)  (row_total + next_row_total)
b) Calculate Error using the probability (calculated in 4(a)) and submission of
row totals (calculated in 3)
error = ( probability * sum ) / sqrt( sum * probability * ( 1 -
probability))
c) Calculate Weight using row total (calculated in 2(a))
weight = sqrt(row_total)
5. For calculating Closeness factor each tuple is paired with rest of the tuples through
looping and then each pair is formulated in order to calculate closeness factor for that
particular pair using below operations
a) Calculate the submission of both the tuples for each parameter
b) Calculate the Error for each parameter
ex1 = probability of primary tuple * parameter submission
(Calculated in 5(a))
ex2 = primary tuple parametric value
ex3 = probability of primary tuple * parameter submission
(Calculated in 5(a))
ex4 = 1 - probability of primary tuple
Error = (ex1 - ex2) / (sqrt(ex3 * ex4)))
c) Calculate Error square for using Error (Calculated in 5(b))
d) Calculate Weight using parameter submission (Calculated in 5(a))
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 33
Weight = sqrt(parameter_submission)
e) Calculate multiplication of Error square (Calculated in 5(c)) and
Weight (Calculated in 5(d))
Mul = Error_square * Weight
f) Calculate Closeness Factor
Closeness Factor = (submission of Mul (Calculated in 5(e)) of
all parameters) / (submission of Weight
(Calculated in 5(d)) of all parameters)
Repeat all the above operation in 5 for all the tuples in the data set.
6. Attach the index with each Closeness factor value in order to identify the pair of
tuples for which Closeness Factor is being calculated.
7. After the indexing, sort the Closeness Factor value in ascending order.
8. Determine the Number of clusters and the range of clusters explicitly.
9. Compare the sorted Closeness Factor value with the ranges and store the values with
index accordingly into the clusters for that particular range.
10. Display all the above data into and excel file or csv file or in the same dataset.csv file
in different sheets.
11. End
In order to process additional dataset, incremental algorithm is being used which is the
extended version of Basic clustering algorithm.
7.4.2 Incremental clustering Algorithm:
1. Enter the name of Additional_dataset.csv file to be imported.
2. Perform all the steps mentioned above in basic clustering algorithm from 1 to 7.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 34
3. After performing above operations, instead of explicitly defining the new cluster
ranges compare the new Closeness Factor values for the additional dataset with the
predefined ranges and append the values into clusters accordingly and if the values
don’t fit in any of the ranges then a new cluster is created and values will get stored
and appended into that cluster.
4. Display the above data into same excel file or csv file.
5. End
7.5 Code
According to the algorithm described above, we have implemented code in the python
language which is available on the Git Repository-
https://github.com/avinashbarfa/Increamental-Clustering
Fig 7.3 Basic Clustering
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 35
Fig7.4 Incremental Clustering
• Input File Format – CSV (Comma Separated Values)
• Output File Format – XLSX (Excel Workbook)
o ClusterData.xlsx – Contains Clusters
o Basic_Diabetes_data.xlsx – Contains parameters calculated for clustering.
o Incremenatl_Diabetes_data.xlsx – Contains parameters calculated for incremental
clustering.
7.6 Deployment on AWS EC2
At this stage we are finally going to achieve the project goal. After deployment and testing the
code on the local machine we have finally deployed it on the Amazon EC2 instance.
We have Selected Following Amazon Machine Image (AMI) and EC2 Configuration
Operating System – Ubuntu 18.04 LTS (64-bit)
Volume Type – SSD
Following are the AWS resource status that we are using in the project.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 36
Fig7.5 AMS Resource Status
After launching the instance, we will use Putty for accessing the instance through the
floating/public IP address and Security key provided by amazon while creating instance.
Fig7.6 AMS Service Health
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 37
Fig7.7 Instance Accessing Using Putty
Now, we have to configure the instance to run our algorithm i.e. we have to install python3 and
its library NumPy, Panda & excel writer. Following are the commands in sequence that to be run
on instance.
sudo apt-get update
sudo apt install python3-pip
sudo apt install python3-numpy
sudo apt install python3-pandas
sudo apt install python3-pandas
sudo apt install python3-openpyxl
Now our instance is configured as per our requirement. Now we will upload our Python
Algorithm on to the EC2 machine using WinSCP.
We will upload the Python File and Data Sheet (in CSV format) on the machine into a
newly created directory ProjectWork.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 38
Fig7.8 WinSCP Instance File Structure
Fig7.9 Putty Instance File Structure
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 39
After entering in the ProjectWork directory in the Putty, we type command
python3 pythonfile to execute the algorithm.
Fig7.10 Command to Run Python File
After hitting enter code begin to executes on the Amazon EC2 Instance, following are the
screenshots while code executes.
Fig7.11 Python Code Execution -1
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 40
Fig7.12 Python Code Execution -2
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 41
7.7 Monitoring Analysis
Amazon provides its monitoring service CloudWatch that enables the user to monitor the
instance while it is in active state.
CloudWatch can be accessed through the Amazon AWS console page.
Fig7.13 AWS Monitoring Console -1
Fig7.14 AWS Monitoring Console -2
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 42
Below are the screenshots of the CPU Utilization, Networks that are taken using CloudWatch
while our code executes.
Fig7.15 CPU Utilization at Starting stage
Fig7.16 CPU Utilization at Intermediate Stage
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 43
Fig7.17 Network In status
Fig7.18 Network Out Status
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 44
8. Result & Analysis
8.1 Results
This section shows the results obtained after executing the CFBA algorithm on
the diabetes dataset referred from Kaggle (PIMA Dataset) and used in the project.
The dataset consists of 8 attributes and 768 instances. All patients belong to the
age group of 20-70 years of Indian heritage.
The study found out relations between various parameters and carried out cluster
analysis wherein we identify which parameters are the underlying causes for a
data series to be an outlier.
The observations are as follows –
1) Finding different parameters, for CFBA Analysis:
After reading the CSV file, we have calculated the row total, column total and
further, the values of parameters – Probability, Error and Weight.
Fig 8.1 Required Parameters
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 45
2) Using CFBA, finding the CF values:
Further, for the CFBA algorithm, we have calculated the parameters – Error,
Weight and their product. This is used to calculate the Closeness Factors for the
data series.
Fig 8.2 Closeness Factor
8.2 Analysis
The aim of our project can be simply broken down into two primary objectives:
1. Organizing the closeness values according to specific ranges.
2. Generating clusters, according to the above ranges.
We have analyzed two datasets PIMA Diabetes and WINE with this Distributed
incremental clustering algorithm and additional datasets can also be processed and
analyzed. Primary dataset which was taken into consideration was PIMA Diabetes
dataset.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 46
1. Organizing the closeness values according to specific ranges:
Here, we have defined custom ranges for the closeness values –
1) Cluster 1 – From 0.0001 to 0.066151499
2) Cluster 2 – From 0.066151499 to 0.106095987
3) Cluster 3 – From 0.106095987 to 0.156869097
4) Cluster 4 – Above 0.156869097
2. Generating clusters, according to the above ranges:
Fig 8.3 Cluster 1
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 47
Fig 8.4 Cluster 2
Fig 8.5 Cluster 3
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 48
Fig 8.6 Cluster 4
8.3 Distributed View
This Incremental clustering project is working on various multiple Instances hence helps
in achieving Elastic web-scale computing, completely control and flexibility in cloud
hosting. In the below example we have shown one Master instance with two slave
instances.
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 49
8.4 Time Estimation
Tuples (No. of Rows) In Local Machine (Approx.) In AWS(Approx.)
100 5 Minutes 49 Seconds 5 Minutes 43 Seconds
200 44 Minutes 31 Seconds 38 Minutes 19 Seconds
300 4 hours 47 Minutes 21 Seconds 4 hours 16 Minutes 36 Seconds
400 11 hours 36 Minute 45 Seconds 11 hours 03 Minutes 17 Seconds
500 1 day 02 hours 32 Minutes 1 day 0 hours 11 Minutes
650 1 day 16 hours 1 day 13 hours
767 2 days 6 hours 2 days 1 hours
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 50
9. Conclusion and Future Scope
Our algorithm’s detection capabilities of high impact parameters in its early stages can
prove to be the key for treatment and prevention. Our algorithm is very helpful for the
doctors and endocrinologist. An endocrinologist is a doctor who specializes in treating
diabetes.
This shows how incremental clustering is used to model actual diagnosis of diabetes for
local and systematic treatment, this algorithm collects and analyses medical record of
diabetic patient with knowledge discovery techniques to extract the information. We used
the CFBA algorithm to find various parameters like the probability, error, closeness.
Using the parameters of the CFBA algorithm we are creating clusters according to the
closeness value and to incorporate new data we have implemented incremental clustering
After this we made a free tier instance. Using Putty and winSCP we deployed it on AWS
(Amazon Web Services)
DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 51
10. References
• Dataset: https://www.kaggle.com/akashkr/pima-indian-diabetes
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.html
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.iloc.html
• https://pandas.pydata.org/pandas-
docs/stable/reference/api/pandas.DataFrame.to_excel.html
• https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/concepts.html
• http://blog.adeel.io/2016/11/19/installing-pandas-scipy-numpy-and-scikit-learn-
on-aws-ec2/
• https://panda.readthedocs.io/en/latest/amazon.html
• https://www.ssh.com/ssh/putty/putty-manuals/0.68/index.html
• https://winscp.net/eng/docs/start

More Related Content

What's hot

IRJET- Spot Me - A Smart Attendance System based on Face Recognition
IRJET- Spot Me - A Smart Attendance System based on Face RecognitionIRJET- Spot Me - A Smart Attendance System based on Face Recognition
IRJET- Spot Me - A Smart Attendance System based on Face RecognitionIRJET Journal
 
UNet-VGG16 with transfer learning for MRI-based brain tumor segmentation
UNet-VGG16 with transfer learning for MRI-based brain tumor segmentationUNet-VGG16 with transfer learning for MRI-based brain tumor segmentation
UNet-VGG16 with transfer learning for MRI-based brain tumor segmentationTELKOMNIKA JOURNAL
 
Face Detection and Recognition using Back Propagation Neural Network (BPNN)
Face Detection and Recognition using Back Propagation Neural Network (BPNN)Face Detection and Recognition using Back Propagation Neural Network (BPNN)
Face Detection and Recognition using Back Propagation Neural Network (BPNN)IRJET Journal
 
Soft computing
Soft computingSoft computing
Soft computingCSS
 
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...ijaia
 
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATIONREVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATIONijaia
 
IRJET- Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...
IRJET-  	  Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...IRJET-  	  Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...
IRJET- Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...IRJET Journal
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering SystemIRJET Journal
 
Visual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning ApproachesVisual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning Approachescsandit
 
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLMITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLijaia
 
Image Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural NetworkImage Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural NetworkCSCJournals
 
Face Recognition using PCA and MSNN
Face Recognition using PCA and MSNNFace Recognition using PCA and MSNN
Face Recognition using PCA and MSNNRABI GAYAN
 
Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...
Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...
Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...Biswaranjan Samal
 
Vision Based Gesture Recognition Using Neural Networks Approaches: A Review
Vision Based Gesture Recognition Using Neural Networks Approaches: A ReviewVision Based Gesture Recognition Using Neural Networks Approaches: A Review
Vision Based Gesture Recognition Using Neural Networks Approaches: A ReviewWaqas Tariq
 
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET Journal
 
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Gareth Setati
 
An interactive image segmentation using multiple user input’s
An interactive image segmentation using multiple user input’sAn interactive image segmentation using multiple user input’s
An interactive image segmentation using multiple user input’seSAT Publishing House
 
DEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATION
DEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATIONDEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATION
DEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATIONijaia
 

What's hot (20)

IRJET- Spot Me - A Smart Attendance System based on Face Recognition
IRJET- Spot Me - A Smart Attendance System based on Face RecognitionIRJET- Spot Me - A Smart Attendance System based on Face Recognition
IRJET- Spot Me - A Smart Attendance System based on Face Recognition
 
UNet-VGG16 with transfer learning for MRI-based brain tumor segmentation
UNet-VGG16 with transfer learning for MRI-based brain tumor segmentationUNet-VGG16 with transfer learning for MRI-based brain tumor segmentation
UNet-VGG16 with transfer learning for MRI-based brain tumor segmentation
 
Face Detection and Recognition using Back Propagation Neural Network (BPNN)
Face Detection and Recognition using Back Propagation Neural Network (BPNN)Face Detection and Recognition using Back Propagation Neural Network (BPNN)
Face Detection and Recognition using Back Propagation Neural Network (BPNN)
 
Soft computing
Soft computingSoft computing
Soft computing
 
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
DATA AUGMENTATION TECHNIQUES AND TRANSFER LEARNING APPROACHES APPLIED TO FACI...
 
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATIONREVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
REVIEWING PROCESS MINING APPLICATIONS AND TECHNIQUES IN EDUCATION
 
IRJET- Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...
IRJET-  	  Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...IRJET-  	  Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...
IRJET- Detection of Writing, Spelling and Arithmetic Dyslexic Problems in...
 
IRJET- Factoid Question and Answering System
IRJET-  	  Factoid Question and Answering SystemIRJET-  	  Factoid Question and Answering System
IRJET- Factoid Question and Answering System
 
Visual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning ApproachesVisual Saliency Model Using Sift and Comparison of Learning Approaches
Visual Saliency Model Using Sift and Comparison of Learning Approaches
 
RESUME_Prakash
RESUME_PrakashRESUME_Prakash
RESUME_Prakash
 
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR MLMITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
MITIGATION TECHNIQUES TO OVERCOME DATA HARM IN MODEL BUILDING FOR ML
 
Image Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural NetworkImage Recognition With the Help of Auto-Associative Neural Network
Image Recognition With the Help of Auto-Associative Neural Network
 
Face Recognition using PCA and MSNN
Face Recognition using PCA and MSNNFace Recognition using PCA and MSNN
Face Recognition using PCA and MSNN
 
Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...
Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...
Performance Analysis of Supervised Machine Learning Techniques for Sentiment ...
 
Bp34412415
Bp34412415Bp34412415
Bp34412415
 
Vision Based Gesture Recognition Using Neural Networks Approaches: A Review
Vision Based Gesture Recognition Using Neural Networks Approaches: A ReviewVision Based Gesture Recognition Using Neural Networks Approaches: A Review
Vision Based Gesture Recognition Using Neural Networks Approaches: A Review
 
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A SurveyIRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
IRJET- Visual Question Answering using Combination of LSTM and CNN: A Survey
 
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
Makgopa Setati_Machine Learning for Decision Support in Distributed Systems_M...
 
An interactive image segmentation using multiple user input’s
An interactive image segmentation using multiple user input’sAn interactive image segmentation using multiple user input’s
An interactive image segmentation using multiple user input’s
 
DEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATION
DEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATIONDEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATION
DEEP-LEARNING-BASED HUMAN INTENTION PREDICTION WITH DATA AUGMENTATION
 

Similar to Increamental Data Clustering

Desktop Grid Scheduling
Desktop Grid SchedulingDesktop Grid Scheduling
Desktop Grid SchedulingPiyush Kandpal
 
Compact optimized deep learning model for edge: a review
Compact optimized deep learning model for edge: a reviewCompact optimized deep learning model for edge: a review
Compact optimized deep learning model for edge: a reviewIJECEIAES
 
IRJET- Deep Learning Techniques for Object Detection
IRJET-  	  Deep Learning Techniques for Object DetectionIRJET-  	  Deep Learning Techniques for Object Detection
IRJET- Deep Learning Techniques for Object DetectionIRJET Journal
 
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextAccelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextSreyas Sriram
 
Cluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHCluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHMisu Md Rakib Hossain
 
Human Activity Recognition System
Human Activity Recognition SystemHuman Activity Recognition System
Human Activity Recognition SystemIRJET Journal
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...gerogepatton
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...ijaia
 
End-to-end deep auto-encoder for segmenting a moving object with limited tra...
End-to-end deep auto-encoder for segmenting a moving object  with limited tra...End-to-end deep auto-encoder for segmenting a moving object  with limited tra...
End-to-end deep auto-encoder for segmenting a moving object with limited tra...IJECEIAES
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...ijcsa
 
Prof Chethan Raj C, Final year Project Report Format
Prof Chethan Raj C, Final year Project Report FormatProf Chethan Raj C, Final year Project Report Format
Prof Chethan Raj C, Final year Project Report FormatProf Chethan Raj C
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEijesajournal
 
Novel holistic architecture for analytical operation on sensory data relayed...
Novel holistic architecture for analytical operation  on sensory data relayed...Novel holistic architecture for analytical operation  on sensory data relayed...
Novel holistic architecture for analytical operation on sensory data relayed...IJECEIAES
 
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...IRJET Journal
 
Asu bus management project (autosaved)
Asu bus management project (autosaved)Asu bus management project (autosaved)
Asu bus management project (autosaved)Birhanu Dagnew
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET Journal
 
SHORTEST PATH FINDING VISUALIZER
SHORTEST PATH FINDING VISUALIZERSHORTEST PATH FINDING VISUALIZER
SHORTEST PATH FINDING VISUALIZERIRJET Journal
 

Similar to Increamental Data Clustering (20)

Desktop Grid Scheduling
Desktop Grid SchedulingDesktop Grid Scheduling
Desktop Grid Scheduling
 
Compact optimized deep learning model for edge: a review
Compact optimized deep learning model for edge: a reviewCompact optimized deep learning model for edge: a review
Compact optimized deep learning model for edge: a review
 
IRJET- Deep Learning Techniques for Object Detection
IRJET-  	  Deep Learning Techniques for Object DetectionIRJET-  	  Deep Learning Techniques for Object Detection
IRJET- Deep Learning Techniques for Object Detection
 
50120140505015 2
50120140505015 250120140505015 2
50120140505015 2
 
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator ContextAccelerated Prototyping of Cyber Physical Systems in an Incubator Context
Accelerated Prototyping of Cyber Physical Systems in an Incubator Context
 
Cluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICHCluster Setup Manual Using Ubuntu and MPICH
Cluster Setup Manual Using Ubuntu and MPICH
 
Human Activity Recognition System
Human Activity Recognition SystemHuman Activity Recognition System
Human Activity Recognition System
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
 
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
A NOVEL SCHEME FOR ACCURATE REMAINING USEFUL LIFE PREDICTION FOR INDUSTRIAL I...
 
End-to-end deep auto-encoder for segmenting a moving object with limited tra...
End-to-end deep auto-encoder for segmenting a moving object  with limited tra...End-to-end deep auto-encoder for segmenting a moving object  with limited tra...
End-to-end deep auto-encoder for segmenting a moving object with limited tra...
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
 
Prof Chethan Raj C, Final year Project Report Format
Prof Chethan Raj C, Final year Project Report FormatProf Chethan Raj C, Final year Project Report Format
Prof Chethan Raj C, Final year Project Report Format
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCEANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
ANALYSIS OF SYSTEM ON CHIP DESIGN USING ARTIFICIAL INTELLIGENCE
 
Novel holistic architecture for analytical operation on sensory data relayed...
Novel holistic architecture for analytical operation  on sensory data relayed...Novel holistic architecture for analytical operation  on sensory data relayed...
Novel holistic architecture for analytical operation on sensory data relayed...
 
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
IRJET- Automated Student’s Attendance Management using Convolutional Neural N...
 
Asu bus management project (autosaved)
Asu bus management project (autosaved)Asu bus management project (autosaved)
Asu bus management project (autosaved)
 
IRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining TechniquesIRJET- PDF Extraction using Data Mining Techniques
IRJET- PDF Extraction using Data Mining Techniques
 
SHORTEST PATH FINDING VISUALIZER
SHORTEST PATH FINDING VISUALIZERSHORTEST PATH FINDING VISUALIZER
SHORTEST PATH FINDING VISUALIZER
 

Recently uploaded

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Christo Ananth
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 

Recently uploaded (20)

FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 

Increamental Data Clustering

  • 1. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 1 Project Semester – 8th January 10 to April 20, 2019 Department of Computer Science Engineering & Information Technology Symbiosis Institute of Technology, Pune Final Year Project Report on Analysis of Diabetes Dataset Using Distributed Incremental Clustering Algorithm and AWS Submitted by: Aditya Maheshwari 15070121503 Avinash Barfa 15070121507 Kunal Gulati 15070121525 Palash Verma 15070121140 Under the Guidance of: Dr. Preeti Mulay Mr. Rahul Joshi Professor, Assistant Professor, CS and IT Department, CS and IT Department, SIT, Pune. SIT, Pune.
  • 2. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 2 DECLARATION We hereby declare that the project work entitled ―Analysis of Diabetes Dataset Using Distributed Incremental Clustering Algorithm and AWS is an authentic record of our work carried out at as requirements for final year project for the award of B.Tech degree in Computer Science and Information Technology Engineering at Symbiosis Institute of Technology Pune, affiliated to Symbiosis International Deemed University, Pune under the guidance of Dr. Preeti Mulay and Mr. Rahul Joshi, during January 2019 to April 2019. Aditya Maheshwari Avinash Barfa Kunal Gulati Palash Verma 15070121503 15070121507 15070121525 15070121140 Date: ___________________ Certified that the above statement made by the student is correct to the best of our knowledge and belief. Dr. Preeti Mulay Mr. Rahul Joshi Professor, Assistant Professor, CS and IT Department, CS and IT Department, SIT, Pune. SIT, Pune.
  • 3. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 3 ACKNOWLEDGEMENT The success and final outcome of this project required a lot of guidance and assistance from many people and we are extremely fortunate to have got this all along the completion of our project work. Whatever we have done is only due to such guidance and assistance and we would not forget to thank them. First and foremost, we are expressing our thankfulness and praise to Symbiosis Institute of Technology, Pune and Department of Computer Science and Information Technology for giving us this wonderful opportunity to undergo B.Tech Project Work, helping us to learn and attain great experience. We would like to thank our HOD, Dr. Shraddha Phansalkar for her valuable guidance, supervising this work and helpful suggestions. We owe our profound gratitude and special thanks to Dr. Preeti Mulay for cooperating with us and giving us her valuable time and information and Prof. Rahul Joshi who in spite of being extraordinarily busy with his duties, took time out to hear, guide and keep us on the correct path by his untiring assistance, direction, encouragement, continuous support, valuable ideas and constructive criticism throughout this project work. At last we are grateful to our respected teachers Department of Computer Science and Information Technology SIT, Pune, family and friends for their help, encouragement and co- operation during the project work. Aditya Maheshwari Avinash Barfa Kunal Gulati Palash Verma 15070121503 15070121507 15070121525 15070121140
  • 4. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 4 List of Figures Fig 1.1 Clusters Fig 1.2 Data Flow Fig 5.1 Entity Relation Diagram Fig 5.2 Use Case Diagram Fig 5.3 State Chart Diagram Fig 5.4 Activity Diagram Fig 6.1 Project Flow Fig 6.2 Basic Clustering Algorithm Dev Plan Fig 6.3 Incremental Clustering Algorithm Dev Plan Fig 6.4 Project Timeline Fig 7.1 System Model Fig 7.2 Process Flow Fig 7.3 Basic Clustering Fig 7.4 Incremental Clustering Fig 7.5 AMS Resource Status Fig 7.6 AMS Service Help Fig 7.7 Instance Accessing using Putty Fig 7.8 WinSCP Instance File Structure Fig 7.9 Putty Instance File Structure Fig 7.10 Commands to Run python file Fig 7.11 Python code Execution 1 Fig 7.12 Python code Execution 2 Fig 7.13 AWS Monitoring Console 1 Fig 7.14 AWS Monitoring Console 2 Fig 7.15 CPU Utilization at starting stage Fig 7.16 CPU Utilization at intermediate stage Fig 7.17 Network in Status Fig 7.18 Network Hour Status Fig 8.1 Required Parameters Fig 8.2 Closeness Factor Fig 8.3 Cluster 1 Fig 8.4 Cluster 2 Fig 8.5 Cluster 3 Fig 8.6 Cluster 4
  • 5. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 5 TABLE OF CONTENTS 1. Introduction…………………………………………………………………………………. 7 1.1. Introduction to Clustering 1.2. Introduction to Cloud Environment 1.3. Overview on Diabetes 1.4. Data flow diagram showing CFBA’s statistical details 2. Literature Review………………………………………………………………………….. 13 3. Technical Requirements…………………………………………………………………… 14 3.1. Tools and software used 3.2. Technologies and languages used 4. Software Requirement Analysis……………………………………………………………. 18 4.1. Introduction 4.2. External Interface Requirements 4.2.1. User Interface 4.2.2. Hardware Interface 4.2.3. Software Interface 5. High Level Design……………………………………………………………………….… 20 5.1. Entity Relationship Diagrams 5.2. UML Diagrams 5.2.1. Use Case Diagram 5.2.2. State Chart Diagram 5.2.3. Activity Diagram 6. Project Plan………………………………………………………………………………… 24 6.1. Project Flow 6.2. Development Plan 6.3. Project Timeline 6.4. RMMM Plan
  • 6. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 6 7. Implementation……………………………………………………………………………… 28 7.1. System Model 7.2. Process Flow 7.3. Section of Attributes 7.4. Algorithm 7.5. Code 7.6. Deployment on AWS EC2 7.7. Monitoring Analysis 8. Result and Analysis…………………………………………………………………………. 44 8.1 Results 8.2 Analysis 8.3 Distributed View 8.4 Time Estimation 9. Conclusion and Future Scope………………………………………………………………. 50 10. References…………………………………………………………………………….…… 51
  • 7. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 7 1. Introduction 1.1 Introduction to Clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. We can show this with a simple graphical example: Fig1.1 Clusters In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering.
  • 8. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 8 Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. Clustering Methods: 1. Density-Based Methods: These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters. Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise),OPTICS (Ordering Points to Identify Clustering Structure) etc. 2. Hierarchical Based Methods: The clusters formed in this method forms a tree type structure based on the hierarchy. New clusters are formed using the previously formed one. Examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies) etc. 3. Partitioning Methods: These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon randomized Search) etc. 4. Grid-based Methods: In this method the data space is formulated into a finite number of cells that form a grid-like structure. All the clustering operation done on these grids are fast and independent of the number of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE (Clusteringin Quest) etc.
  • 9. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 9 1.2 Introduction to Cloud Environment Cloud Computing refers to manipulating, configuring, and accessing the hardware and software resources remotely. It offers online data storage, infrastructure, and application. Cloud computing offers platform independency, as the software is not required to be installed locally on the PC. Hence, the Cloud Computing is making our business applications mobile and collaborative. There are certain services and models working behind the scene making the cloud computing feasible and accessible to end users. Following are the working models for cloud computing: Deployment Models Deployment models define the type of access to the cloud, i.e., how the cloud is located? Cloud can have any of the four types of access: Public, Private, Hybrid, and Community. PUBLIC CLOUD The public cloud allows systems and services to be easily accessible to the general public. Public cloud may be less secure because of its openness. PRIVATE CLOUD The private cloud allows systems and services to be accessible within an organization. It is more secured because of its private nature.
  • 10. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 10 COMMUNITY CLOUD The community cloud allows systems and services to be accessible by a group of organizations. HYBRID CLOUD The hybrid cloud is a mixture of public and private cloud, in which the critical activities are performed using private cloud while the non-critical activities are performed using public cloud. 1.3 Overview on Diabetes Diabetes is a disease that occurs when the insulin production in the body is inadequate or the body is unable to use the produced insulin in a proper manner, as a result, this leads to high blood glucose. The body cells break down the food into glucose and this glucose needs to be transported to all the cells of the body. The insulin is the hormone that directs the glucose that is produced by breaking down the food into the body cells. Any change in the production of insulin leads to an increase in the blood sugar levels and this can lead to damage to the tissues and failure of the organs. Generally, a person is considered to be suffering from diabetes, when blood sugar levels are above normal (4.4 to 6.1 mmol/L). Types of Diabetes: The three main types of diabetes are described below: 1. Type 1 Though there are only about 10% of diabetes patients have this form of diabetes, recently, there has been a rise in the number of cases of this type in the United States. The disease manifest as an autoimmune disease occurring at a very young age of below 20 years hence also called juvenile-onset diabetes. 2. Type 2 This type accounts for almost 90% of the diabetes cases and commonly called the adult-onset diabetes or the non-insulin dependent diabetes. In this case the various organs of the body become insulin resistant, and this increases the demand for insulin. At this point, pancreas doesn’t make the required amount of insulin. To
  • 11. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 11 keep this type of diabetes at bay, the patients have to follow a strict diet, exercise routine and keep track of the blood glucose. 3. Gestational diabetes A type of diabetes that tends to occur in pregnant women due to the high sugar levels as the pancreas don’t produce sufficient amount of insulin. Taking no treatment can lead to complications during childbirth. Controlling the diet and taking insulin can control this form of diabetes. Symptoms, Diagnosis and Treatment: The common symptoms of a person suffering from diabetes are: · Polyuria (frequent urination) · Polyphagia (excessive hunger) · Polydipsia (excessive thirst) · Weight gain or strange weight loss · Healing of wounds is not quick, blurred vision, fatigue, itchy skin, etc. Urine test and blood tests are conducted to detect diabetes by checking for excess body glucose. The commonly conducted tests for determining whether a person has diabetes or not are · A1C Test · Fasting Plasma Glucose (FPG) Test · Oral Glucose Tolerance Test (OGTT).
  • 12. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 12 1.4 Data flow diagram showing CFBA’s statistical details Fig 1.2 Data Flow
  • 13. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 13 2. Literature Review Designing of predictive models for diabetes diagnosis has been an active research sector for the past decade. Most of the models found in various literatures are based on different clustering algorithms and different modeling techniques. [1] Preeti Mulay (2016) ‘Threshold Computation to Discover Cluster Structure, a New Approach’, International Journal of Electrical and Computer Engineering (IJECE) Vol. 6, No. 1, pp. 275~282. With the spurt of data in all domains (almost), it is essential to have modernized data exploratory methods, like incremental-clustering, cluster analysis, incremental-learning etc. to name a few. These methods are useful in varied applications which require handling influx of new data consistently and to perform forecasting, decision making and predictions. The purpose of this research paper is to broaden the abilities of “Incremental clustering using Naïve Bays and Closeness-Factor” (ICNBCF) [6] algorithm, and introduce set of activities at post clustering phase. These activities include validating cluster structures. These modifications proved the enhancements in resulting cluster structures. [2] Preeti Mulay and Kulkarni, P.A. (2013) ‘Knowledge augmentation via incremental clustering: new technology for effective knowledge management’, Int. J. Business Information Systems, Vol. 12, No. 1, pp.68–87. This research paper uses a new statistical, error-based incremental clustering algorithm CFBA. It discusses about knowledge augmentation and incremental learning using various datasets. Finally, we give a computational learning theoretic perspective on incremental learning. From this research the attempt was to look into incremental clustering in different and new way. The Software project development is carried out in various phases. Each phase will have planned and actual data details. The alphanumeric attributes can be converted into complete numeric dataset by applying weight-assignment algorithm. This converted numeric dataset is given as input to our proposed, new, statistical, error-based, almost parameter-free algorithm named ‘closeness factor-based algorithm’ (CFBA).
  • 14. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 14 3. Technical Requirement 3.1 Tools and Software Used I.SPYDER • Spyder is a powerful scientific environment written in Python, for Python, and designed by and for scientists, engineers and data analysts. • It offers a unique combination of the advanced editing, analysis, debugging, and profiling functionality of a comprehensive development tool with the data exploration, interactive execution, deep inspection, and beautiful visualization capabilities of a scientific package. • Beyond its many built-in features, its abilities can be extended even further via its plugin system and API. Furthermore, Spyder can also be used as a PyQt5 extension library, allowing developers to build upon its functionality and embed its components, such as the interactive console, in their own PyQt software. II. PUTTY • PuTTY is a client program for the SSH, Telnet and Rlogin network protocols. • These protocols are all used to run a remote session on a computer, over a network. PuTTY implements the client end of that session: the end at which the session is displayed, rather than the end at which it runs. • PuTTY has been ported to various other operating systems. Official ports are available for some Unix-like platforms, with work-in-progress ports to Classic Mac OS and macOS, and unofficial ports have been contributed to platforms such as Symbian, Windows Mobile and Windows Phone.
  • 15. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 15 III. WinSCP • WinSCP is a free file transfer tool for Windows that supports FTP, SFTP and SCP. It provides a Windows Explorer style interface that lets you drag and drop files or folders between local and remote locations. • Its main function is secure file transfer between a local and a remote computer. Beyond this, WinSCP offers basic file manager and file synchronization functionality. • For secure transfers, it uses Secure Shell (SSH) and supports the SCP protocol in addition to SFTP. IV. Microsoft MS Excel • Microsoft Excel provides a grid interface to organize nearly any type of information. • The learning curve for Excel is very short, so it's easy to use Excel and be productive right away. Rare are the situations where IT staff creates spreadsheets, information workers can do for themselves. • Excel makes it easy to store data, perform numerical calculations, format cells, and adjust layouts to generate the output and reports to share with others. Advanced features such as: subtotals, power pivot tables and pivot charts, analysis toolkit, and many templates make it easy to accomplish a wide range of tasks. • It can even integrate with the Analytic Services (Business Intelligence) from SQL Server. Tweaking the results is also very easy to get the exact layout, fonts, colors etc. that you want.
  • 16. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 16 3.2 Technologies and Languages Used I.PYTHON • Python is a full-fledged all-round language. It's an interpreted, interactive, object-oriented, extensible programming language. • It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms. • Python supports multiple programming paradigms, including object- oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive standard library. • Python is widely used in Artificial Intelligence, Natural Language Generation, Neural Networks and other advanced fields of Computer Science. Python had deep focus on code readability & this class will teach you python from basics. . II. AMAZON WEB SERVICES Amazon Web Services (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms to individuals, companies and governments, on a metered pay- as-you-go basis. In aggregate, these cloud computing web services provide a set of primitive, abstract technical infrastructure and distributed computing building blocks and tools. One of these services is Amazon Elastic Compute Cloud, which allows users to have at their disposal a virtual cluster of computers, available all the time, through the Internet. AWS's version of virtual computers emulate most of the attributes of a real computer including hardware (CPU(s) & GPU(s) for processing, local/RAM memory, hard-disk/SSD storage); a choice of operating systems; networking; and pre-loaded application software such as web servers, databases, CRM, etc.
  • 17. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 17 III. AMAZON ELASTIC COMPUTE CLOUD Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers. Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal friction. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate them from common failure scenarios. IV. CLOUD WATCH • AWS CloudWatch is a monitoring and management service built for developers, system operators, and IT managers. • It monitors your Amazon Web Services (AWS) resources and the applications you run on AWS in real time • To collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources. • To set high-resolution alarms, take automated actions, solve troubleshoot issues, and discover insights to optimize your applications and ensure they are running smoothly.
  • 18. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 18 4. Software Requirement and Analysis 4.1 Introduction This part of the document is a comprehensive description of the intended purpose of the research paper. This fully describes what the “Closeness Factor Based Algorithm”, CFBA will do when applied to a large consistent set of data and how it helps generate clusters (using basic clustering and incremental clustering approach) which upon further analysis will help in finding hidden relationships/patterns in the data. This document also enlists enough and necessary requirements that are required for the research project. This is a description of how large data can be mined using data mining techniques like clustering and includes a set of use cases that describe how raw data is transformed into useful information. 4.2 External Interface Requirements 4.2.1 User Interfaces This is a research project with the objective of analyzing the datasets generated, by using a clustered data mining algorithm to create clusters and to find hidden relations/patterns between various parameters of a diabetic patient. Therefore, at this stage no particular User Interface has been defined. 4.2.2 Hardware Interfaces 4.2.3 Software Interfaces Closeness factor-based Algorithm (CFBA) - • Initial phase - Completely new dataset form basic clusters.
  • 19. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 19 • Incremental phase - It compares closeness value with already formed clusters (Centre of cluster) and take decision of appending matching clusters. • Final phase - Updated set of clusters is made available to analyst for taking decision based on augmented knowledge. For implementing the Closeness factor-based Algorithm (CFBA) Putty software is required with - • Version 0.71 or advance • Built platform 64-bit x86 windows
  • 20. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 20 5. High Level Design 5.1 Entity Relationship Diagram Fig 5.1 Entity Relationship Diagram
  • 21. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 21 5.2 UML Diagrams 5.2.1 Use Case Diagram Fig 5.2 Use Case Diagram
  • 22. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 22 5.2.2 State Chart Diagram Fig 5.3 State Chart Diagram
  • 23. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 23 5.2.3 Activity Diagram Fig 5.4 Activity Diagram
  • 24. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 24 6. Project Plan 6.1 Project Flow Fig 6.1 Project Flow 6.2 Development Plan Fig 6.2 Basic Clustering Algorithm Development Plan
  • 25. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 25 Fig 6.3 Incremental Clustering Algorithm Development Plan
  • 26. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 26 6.3 Project Timeline Fig 6.4 Project Timeline
  • 27. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 27 6.4 RMMM Plan The goal of the risk mitigation, monitoring and management plan is to identify as many potential risks as possible. Risk Description • Project Risks: Identifies potential schedule, personnel, resource, and requirements problems and their impact on the project. It threatens the project plan. That is, if project risks become real, it is likely that project schedule will slip and that costs will increase. • Technical Risks: Identifies potential design, implementation, interface, verification, and maintenance problems. Technical risks threaten the quality and timeliness of the software to be produced. If a technical risk becomes a reality, Implementation may become difficult or impossible. • Network Risks: This includes the network failure and other network related risks. • Support Risk: The degree of uncertainty that the resultant software will be easy to correct, adapt, and enhance.
  • 28. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 28 7. Implementation 7.1 System Model Fig 7.1 System Model
  • 29. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 29 7.2 Process Flow The model used below is a modification of general water-flow process model which presents a basic architectural flow of our research project. This model contains four stages namely: topic exploration, data collection and exploration, data cleansing, exploratory data mining and lately integration and verification. Each of these stages are detailed below – Fig 7.2 Process Flow
  • 30. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 30 Step-1: Topic exploration · Subject and topic selection: Selection of subject for research and providing a brief detail behind its selection. · Survey of available literature: Check the availability and legitimacy of necessary data and facts needed to be mined for information generation. · Review background information: Use of research guides and websites with respect to the subject of research to gather necessary information in order to move ahead in the right direction. · Refine research matter and develop initial hypothesis: Creating a detailed question from the information gathered on the selected topic in order to develop an initial hypothesis which will be answered by this research paper. Step-2: Data collection and explore focused information · To collect information in a more focused manner based on the research problem. This involves understanding and uncovering hidden relations and important aspects in order to determine the various data parameters needed to create data sets which will be mined for information in the later stages. Collection of data sets: Collecting data from professional clinics and hospitals as well as from online available data warehouses to generate data sets for the objective stated. Step-3: Extraction, cleansing of data for analysis • This stage involves checking consistency and validity of data gathered by analyzing and removing data sets which depict inconsistencies such as: missing data parameter, invalid data parameters. Step-4: Exploratory data mining • This step involves searching or developing data mining algorithms which will help confirm the hypothesis stated for the research project. Here, use of data mining tools take place in order to use or build algorithms to be used on the data sets collected. Step-5: Integration and verification • This step is where the data mining algorithm is applied to the collected and pre- processed data sets to generate beneficial information which will help confirm the initial hypothesis stated for the research project and provide a basis of developing hidden essential relationships among the parameters used.
  • 31. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 31 7.3 Selection of Attributes CFBA and Incremental Algorithm was applied to the diabetes patient’s datasets taking into account 8 attributes. These attributes were chosen based on common occurrence in every diabetic patient’s report and their relative importance with respect to factors triggering the disease. The input attributes being utilized as parameters in the study are: Sr. No Attribute Description 1. Pregnancies Number of times pregnant 2. Glucose Plasma glucose concentration 2 hours in an oral glucose tolerance test 3. BloodPressure Diastolic blood pressure (mm Hg) 4. SkinThickness Triceps skin fold thickness (mm) 5. Insulin 2-Hour serum insulin (mu U/ml) 6. BMI Body mass index (weight in kg/ (height in m) ^2) 7. DiabetesPedigreeFunction Diabetes pedigree function 8. Age Age (years) 7.4 Algorithm The present work intends to implement an incremental clustering algorithm in order to analyze its results, this algorithm is a modified version of CFBA (Closeness Factor based algorithm) which gives new findings in terms of grouping patients with similar diabetic parameters thus make doctor’s task easy in providing treatment. Algorithm is implemented using python programming on the Anaconda IDE (Spyder version 3.2.8) and putty(Version 0.71) after AWS deployment. 7.4.1 Basic clustering Algorithm: 1. Enter the name of dataset.csv file to be imported. 2. For all the values in dataset a) Calculate the row total b) Calculate the column total
  • 32. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 32 3. Calculate the submission of row totals and store it in a variable example - sum 4. For all the tuples / rows a) Calculate Probability using formula Probability = (row_total) (row_total + next_row_total) b) Calculate Error using the probability (calculated in 4(a)) and submission of row totals (calculated in 3) error = ( probability * sum ) / sqrt( sum * probability * ( 1 - probability)) c) Calculate Weight using row total (calculated in 2(a)) weight = sqrt(row_total) 5. For calculating Closeness factor each tuple is paired with rest of the tuples through looping and then each pair is formulated in order to calculate closeness factor for that particular pair using below operations a) Calculate the submission of both the tuples for each parameter b) Calculate the Error for each parameter ex1 = probability of primary tuple * parameter submission (Calculated in 5(a)) ex2 = primary tuple parametric value ex3 = probability of primary tuple * parameter submission (Calculated in 5(a)) ex4 = 1 - probability of primary tuple Error = (ex1 - ex2) / (sqrt(ex3 * ex4))) c) Calculate Error square for using Error (Calculated in 5(b)) d) Calculate Weight using parameter submission (Calculated in 5(a))
  • 33. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 33 Weight = sqrt(parameter_submission) e) Calculate multiplication of Error square (Calculated in 5(c)) and Weight (Calculated in 5(d)) Mul = Error_square * Weight f) Calculate Closeness Factor Closeness Factor = (submission of Mul (Calculated in 5(e)) of all parameters) / (submission of Weight (Calculated in 5(d)) of all parameters) Repeat all the above operation in 5 for all the tuples in the data set. 6. Attach the index with each Closeness factor value in order to identify the pair of tuples for which Closeness Factor is being calculated. 7. After the indexing, sort the Closeness Factor value in ascending order. 8. Determine the Number of clusters and the range of clusters explicitly. 9. Compare the sorted Closeness Factor value with the ranges and store the values with index accordingly into the clusters for that particular range. 10. Display all the above data into and excel file or csv file or in the same dataset.csv file in different sheets. 11. End In order to process additional dataset, incremental algorithm is being used which is the extended version of Basic clustering algorithm. 7.4.2 Incremental clustering Algorithm: 1. Enter the name of Additional_dataset.csv file to be imported. 2. Perform all the steps mentioned above in basic clustering algorithm from 1 to 7.
  • 34. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 34 3. After performing above operations, instead of explicitly defining the new cluster ranges compare the new Closeness Factor values for the additional dataset with the predefined ranges and append the values into clusters accordingly and if the values don’t fit in any of the ranges then a new cluster is created and values will get stored and appended into that cluster. 4. Display the above data into same excel file or csv file. 5. End 7.5 Code According to the algorithm described above, we have implemented code in the python language which is available on the Git Repository- https://github.com/avinashbarfa/Increamental-Clustering Fig 7.3 Basic Clustering
  • 35. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 35 Fig7.4 Incremental Clustering • Input File Format – CSV (Comma Separated Values) • Output File Format – XLSX (Excel Workbook) o ClusterData.xlsx – Contains Clusters o Basic_Diabetes_data.xlsx – Contains parameters calculated for clustering. o Incremenatl_Diabetes_data.xlsx – Contains parameters calculated for incremental clustering. 7.6 Deployment on AWS EC2 At this stage we are finally going to achieve the project goal. After deployment and testing the code on the local machine we have finally deployed it on the Amazon EC2 instance. We have Selected Following Amazon Machine Image (AMI) and EC2 Configuration Operating System – Ubuntu 18.04 LTS (64-bit) Volume Type – SSD Following are the AWS resource status that we are using in the project.
  • 36. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 36 Fig7.5 AMS Resource Status After launching the instance, we will use Putty for accessing the instance through the floating/public IP address and Security key provided by amazon while creating instance. Fig7.6 AMS Service Health
  • 37. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 37 Fig7.7 Instance Accessing Using Putty Now, we have to configure the instance to run our algorithm i.e. we have to install python3 and its library NumPy, Panda & excel writer. Following are the commands in sequence that to be run on instance. sudo apt-get update sudo apt install python3-pip sudo apt install python3-numpy sudo apt install python3-pandas sudo apt install python3-pandas sudo apt install python3-openpyxl Now our instance is configured as per our requirement. Now we will upload our Python Algorithm on to the EC2 machine using WinSCP. We will upload the Python File and Data Sheet (in CSV format) on the machine into a newly created directory ProjectWork.
  • 38. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 38 Fig7.8 WinSCP Instance File Structure Fig7.9 Putty Instance File Structure
  • 39. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 39 After entering in the ProjectWork directory in the Putty, we type command python3 pythonfile to execute the algorithm. Fig7.10 Command to Run Python File After hitting enter code begin to executes on the Amazon EC2 Instance, following are the screenshots while code executes. Fig7.11 Python Code Execution -1
  • 40. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 40 Fig7.12 Python Code Execution -2
  • 41. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 41 7.7 Monitoring Analysis Amazon provides its monitoring service CloudWatch that enables the user to monitor the instance while it is in active state. CloudWatch can be accessed through the Amazon AWS console page. Fig7.13 AWS Monitoring Console -1 Fig7.14 AWS Monitoring Console -2
  • 42. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 42 Below are the screenshots of the CPU Utilization, Networks that are taken using CloudWatch while our code executes. Fig7.15 CPU Utilization at Starting stage Fig7.16 CPU Utilization at Intermediate Stage
  • 43. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 43 Fig7.17 Network In status Fig7.18 Network Out Status
  • 44. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 44 8. Result & Analysis 8.1 Results This section shows the results obtained after executing the CFBA algorithm on the diabetes dataset referred from Kaggle (PIMA Dataset) and used in the project. The dataset consists of 8 attributes and 768 instances. All patients belong to the age group of 20-70 years of Indian heritage. The study found out relations between various parameters and carried out cluster analysis wherein we identify which parameters are the underlying causes for a data series to be an outlier. The observations are as follows – 1) Finding different parameters, for CFBA Analysis: After reading the CSV file, we have calculated the row total, column total and further, the values of parameters – Probability, Error and Weight. Fig 8.1 Required Parameters
  • 45. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 45 2) Using CFBA, finding the CF values: Further, for the CFBA algorithm, we have calculated the parameters – Error, Weight and their product. This is used to calculate the Closeness Factors for the data series. Fig 8.2 Closeness Factor 8.2 Analysis The aim of our project can be simply broken down into two primary objectives: 1. Organizing the closeness values according to specific ranges. 2. Generating clusters, according to the above ranges. We have analyzed two datasets PIMA Diabetes and WINE with this Distributed incremental clustering algorithm and additional datasets can also be processed and analyzed. Primary dataset which was taken into consideration was PIMA Diabetes dataset.
  • 46. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 46 1. Organizing the closeness values according to specific ranges: Here, we have defined custom ranges for the closeness values – 1) Cluster 1 – From 0.0001 to 0.066151499 2) Cluster 2 – From 0.066151499 to 0.106095987 3) Cluster 3 – From 0.106095987 to 0.156869097 4) Cluster 4 – Above 0.156869097 2. Generating clusters, according to the above ranges: Fig 8.3 Cluster 1
  • 47. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 47 Fig 8.4 Cluster 2 Fig 8.5 Cluster 3
  • 48. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 48 Fig 8.6 Cluster 4 8.3 Distributed View This Incremental clustering project is working on various multiple Instances hence helps in achieving Elastic web-scale computing, completely control and flexibility in cloud hosting. In the below example we have shown one Master instance with two slave instances.
  • 49. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 49 8.4 Time Estimation Tuples (No. of Rows) In Local Machine (Approx.) In AWS(Approx.) 100 5 Minutes 49 Seconds 5 Minutes 43 Seconds 200 44 Minutes 31 Seconds 38 Minutes 19 Seconds 300 4 hours 47 Minutes 21 Seconds 4 hours 16 Minutes 36 Seconds 400 11 hours 36 Minute 45 Seconds 11 hours 03 Minutes 17 Seconds 500 1 day 02 hours 32 Minutes 1 day 0 hours 11 Minutes 650 1 day 16 hours 1 day 13 hours 767 2 days 6 hours 2 days 1 hours
  • 50. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 50 9. Conclusion and Future Scope Our algorithm’s detection capabilities of high impact parameters in its early stages can prove to be the key for treatment and prevention. Our algorithm is very helpful for the doctors and endocrinologist. An endocrinologist is a doctor who specializes in treating diabetes. This shows how incremental clustering is used to model actual diagnosis of diabetes for local and systematic treatment, this algorithm collects and analyses medical record of diabetic patient with knowledge discovery techniques to extract the information. We used the CFBA algorithm to find various parameters like the probability, error, closeness. Using the parameters of the CFBA algorithm we are creating clusters according to the closeness value and to incorporate new data we have implemented incremental clustering After this we made a free tier instance. Using Putty and winSCP we deployed it on AWS (Amazon Web Services)
  • 51. DEPARTMENT OF CS-IT | SYMBIOSIS INSTITUTE OF TECHBOLOGY, PUNE 51 10. References • Dataset: https://www.kaggle.com/akashkr/pima-indian-diabetes • https://pandas.pydata.org/pandas- docs/stable/reference/api/pandas.DataFrame.html • https://pandas.pydata.org/pandas- docs/stable/reference/api/pandas.DataFrame.iloc.html • https://pandas.pydata.org/pandas- docs/stable/reference/api/pandas.DataFrame.to_excel.html • https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/concepts.html • http://blog.adeel.io/2016/11/19/installing-pandas-scipy-numpy-and-scikit-learn- on-aws-ec2/ • https://panda.readthedocs.io/en/latest/amazon.html • https://www.ssh.com/ssh/putty/putty-manuals/0.68/index.html • https://winscp.net/eng/docs/start