SlideShare a Scribd company logo
1 of 6
Download to read offline
Mining Stream Data using k-Means clustering Algorithm
1
Medi Manishankar, 2
Dr. K. Venkateswara Rao
1
PG Scholar in CSE, CVRCE, Hyderabad, India,
manishankarmedi@gmail.com
2
Professor of CSE, CVRCE, Hyderabad, India,
kvenkat.cse@gmail.com
Abstract-Stream data is the data coming continuously with time-stamped sequences. It may come from
various locations with varying update rates and with high dimensionality. With the reason, stream data flows fast
and changes quickly, sometimes it may not be possible to process every element, and storage is also a typical
issue. Numerous methods such as random sampling, sliding windows and histograms are available to handle
stream data. Stream data has many applications such as Traffic data analysis, Telecommunications and network
data analysis, stock market data analysis. Data mining techniques such as association analysis, classification and
clustering are generally used for stream data analysis. In this paper, k-means clustering algorithm is used for
mining urban road traffic stream data of a particular city. The stream data is handled using sliding window
technique. The clusters are shown graphically using visualization techniques in python. The clusters are updated
in-real-time to enable people to understand behavior of the traffic. The results are described in this paper.
Keywords: Stream data, clustering, sliding windows, k-Means clustering, urban road traffic
1. INTRODUCTION
In this era of big data, the data is produced at
extremely high speed and in huge volumes. In many
cases the data is in the form of data streams, making
the approach of storing and querying the data
offline is infeasible. According to the Digital
Universe study [3] over 2.8ZB of data were created
and processed in 2012, with a projected increase of
15 times by 2020. This growth in the production of
digital data results from our surrounding
environments equipped with lot of sensors. The
sensors sense the information and send various
parameters forcing it need to be analyzed online.
In contrast, a data stream is large volume [9] of
data arriving continuously with a time stamp and it
is either unnecessary or impractical to store the data
in some form of memory.
For many recent applications [3], the word data
stream is best suitable than a dataset. Examples of
domains producing stream data are
Telecommunications and networking, Stock
marketing, e-commerce and etc. Considering
telecommunications; it needs to analyze the calling
records, Weblogs and Web page clicks as they arrive
continuously. Stream data, having characteristics
such as it comes continuously, is difficult to handle
due to its characteristics such as it temporally
ordered, fast-changing and potentially infinite.
So the biggest challenge in stream data is
handling it before it is going to be expired [3], [4],
and performing required analysis. In handling, we
have to consider the single scan best output
algorithm since multiple scans are not possible in
the stream data. Sometimes stream data is likely to
have high dimensionality as the number of
attributes for each record in some large,
multipurpose applications like satellite mounted
remote sensors, real-time robotic applications. And
For each record, there might be a chance of the
high range of possible values for each attribute.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO: 2236-6124
Page No:390
For effective processing of stream data, new
data structures, techniques, and algorithms are
needed. Because we do not have an infinite amount
of space to store stream data, we often do tradeoff
between accuracy and storage. That is, we generally
are willing to settle for approximate rather than
exact solutions. Many data stream based algorithms
compute an approximate answer within a factor e of
the actual answer, with high probability. Generally,
as the approximation factor goes down, the space
requirements go up. Some common data structures
and techniques used for stream data processing are
sliding window, random sampling, and histograms
[1], [10].
Throughout the stream data processing
techniques, the sliding window is the feasible
technique since it makes decisions based on recent
data instead of running computations on all of the
data seen so far. And it runs computations in main
memory. So, these criteria result in effective and
reducing memory usage which helps in limiting
storage consumption. The main problem in data
stream mining is discovering knowledge or trends
using supervised or unsupervised learning
techniques. In unsupervised learning, clustering
leads to the discovery of hidden information.
Clustering can be defined as the grouping of
similar objects based on their properties (attributes).
Clustering should be done in such a way that the
objects are very similar with the other objects within
the same cluster and dissimilar with objects in the
other clusters.
This paper describes clustering of urban road
traffic stream data. Our objective is visualization
and discovering density of the traffic. This will helps
the users who traveling in the urban road getting to
know the busy roads and for the traffic authorities
to keep track of road traffic in real time, to make
decisions in less time to avoid traffic congestion.
Updating visual graphs in real-time to enable people
to easily understand traffic patterns.
2. STREAM DATA PROCESSING
In the following sections, we discuss stream
data processing techniques, applications, challenges,
and issues. Data Stream Management Systems also
discussed.
2.1 Techniques for Stream Data Processing
For processing stream data there exist,
numerous methods present like random sampling
[1], [10], [11] by taking a sample of data from the
entire data stream reservoir sampling is the inside
technique in the random sampling , sliding window
performing computations on recent data [2], [9],
partitioning histograms which involves the data
stream in to buckets and further process.
In the sliding window concept, the main idea is
instead of running computations on all the data so
far, it runs computations on the recent data and
makes decisions [1]. More formally, at every time t,
a new data element arrives. This element “expires”
at time t +w, where w is the window “size” or
length.
The sliding window model is useful for stocks
or sensor networks, where only recent events are
important. It also reduces memory requirements
because only a small window of data is stored. In
the sliding window, there are two types of methods
present: Count- based sliding window and time-
based sliding window. In the count based sliding
window, we fix the number of records to be
processed at one pass, and in the time stamp sliding
window, after how much time new records to be
entered into sliding window will mentioned.
Clustering streaming data is particularly challenging
because it involves dynamically merging and
splitting evolving clusters over which statistical
summaries are maintained in real-time as the stream
progresses [2].
2.2 Data Stream Processing Applications,
Challenges, and Issues
In the recent days, many applications [4], [5] are
producing massive amounts of data.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO: 2236-6124
Page No:391
For example, Telecommunications and network area,
where the central database system keep tracking on
user calling records, web clicks, network
monitoring, fraud calls, call drops and etc.., With all
these data as they arrive online need to be analyzed
in order to make decisions which leads to rectifying
the problems in real time.
In the Business area, it is very common in the
financial market or stock market number of buyers
and sellers performing the actions by exchanging
equities. So, there need to analyze the data in the
real-time to give knowledge to the people about the
trends in the stock market.
In the urban road traffic, numbers of vehicles are
traveling on the road. With the IoT comes into the
real world, automobiles equipped with advanced
sensors that help in collect data useful for stream
data analysis which lead us to discover knowledge
on density areas in the real-time road traffic, and
decision making in modifications of signaling
systems for hassle-free traveling to the users.
But, while dealing with stream data there are
some challenges and issues arises that data is
coming from complex environments with
continuous flow and high dimensionality.
Sometimes there might be missing data in some
records. So, here three principal challenges [3] are
involving: Volume, Velocity, and Volatility. Here
data preprocessing is important to avoid noisy data,
redundant data, outlier elimination and fill fabricate
missing values.
However, some other challenges [4] also to be
mentioned are
• Data Uncertainty
• Data type treatment
• Cluster validity
• High Dimensionality of Data.
2.3 Data Stream Management Systems (DSMS)
For processing stream data, Data stream
management is needed. DataStream Management
systems are software systems that manage and
support querying of continuous data streams. In
Data stream management system, it handles the data
as input and processes it with the help of query
processor stores the results in the databases.
Data stream management systems [2] [7] emerge
to support a large class of applications such as stack
trade marketing, network traffic monitoring, sensor
data analysis, real-time data warehousing, etc..,
DSMS process continues queries over a high
volume and time-varying data streams. In a
Continuous Query (CQ) system, a user registers
queries specifying their interests over unbounded,
streaming data sources.
A query engine continuously evaluates the
query results such as a new data arrives from the
sources and delivers unbounded streaming outputs
to the appropriate users. A core operator in a CQ
system is sliding window join among streams.
The following figure depicts the DSMS
architecture.
Fig 1 Data stream Management Systems (DSMS)
Various components of the DSMS are as follows.
Local DB- summary data, historical data, Indices.
Meta-database stores schema used to express queries.
The local database maintains in main memory data
representing current data summaries as well as
historical data using main-memory indexes for
efficient search.
The continuous query processor optimizes and executes
continuous queries over the data streams by
accessing the meta-data and the local database.
Continuous queries can span both streaming and
local data.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO: 2236-6124
Page No:392
But, there are some requirements to achieve
efficiency with DSMS that affects scalability issues.
Those are
1. Arriving elements have to be processed on the fly.
2. Processing data streams on a single pass or scan on
data.
3. Processing engine needs to have low latency and
high throughput.
4. For high streams rates and large window sizes, a
sliding window join might consume a large amount
of memory to store tuples.
3. DATA STREAM CLUSTERING
Clustering is the process of grouping similar
objects into the same group. A cluster is a collection
of data objects that are similar to one another with in
the same cluster and are dissimilar to the objects in
other clusters. Clustering algorithms, in general,
have been categorized into five types: partitioning,
hierarchical, density-based, grid-based, and model-
based. Due to the need of computation on the data
in a single scan, partitioning based algorithms are
best suitable which are based on Divide-and-
Conquer strategy (One scan Divide-and- Conquer
approaches have been widely used to cluster data
streams [6]).
In Partitioning based techniques, K-Means
and K-Medoids [1] [6], are very popular in the data
mining world. Recently, Alkermann proposed an
online K-Means algorithm by maintaining a small
sketch of the input using the merge reduce
technique. K-means algorithm is the best suitable
algorithm while dealing with numerical data. If
variables are huge, then K-Means is most of the
times computationally faster than hierarchical
clustering, if we keep ‘k’ small.
3.1 k-Means Clustering Algorithm for Stream
Data Processing
Below are the main steps in K-Means clustering
algorithm.
Algorithm: k-Means Clustering Algorithm
Input: D={r1, r2, r3, …,rn} and K value
Output: K clusters
Method:
Initialize mean values for K-clusters randomly;
Say m1, m2, m3, …, mk;
Repeat
Assign each record in D to the cluster based on
similarity;
Calculate new mean for each cluster;
Until convergence criteria are met;
Following flow chart in Fig 3 describes the
algorithm.
Fig 3 k-Means clustering algorithm flow chart
While performing clustering of data with the k-
Means clustering algorithm, first it need to set the
‘k’ value. It is known as number of clusters. After
fixing of k value, initialize the centroid values of
each cluster with a random value. For mining urban
road traffic, it is needed to take two-coordinate
values as the location has two coordinates in two
dimensional views. Say, for each cluster set a cluster
centroid value C(x, y). Here, centroid is a data point
(imaginary or real) at the center of a cluster.
Here, in clustering first, it randomly selects k of
the objects, each of which initially represents a
cluster mean or center. For each of the remaining
objects, an object is assigned to the cluster to which
it is the most similar, based on the distance between
the object and the cluster mean. The distance can be
calculated by using equation (1).
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO: 2236-6124
Page No:393
……….. (
Where,
n = number of attributes for an object,
ajis the jth
attribute,
mi..ajis the mean of the jth
attribute
After, assigning the objects to each cluster, now
update the cluster centroid Ci(X,
calculating mean values in the cluster. It can be
calculated by
Cx= (x1, x2, x3,……xn)/n
Cy= (y1, y2, y3,……yn)/n
Ci = (Cx, Cy)
Now, repeat the process of assigning
clusters by calculating distances as described
This process will be continues till it meets
convergence criteria.
Convergence criteria can be described in 3 ways.
• There is no change in the mean values of the
clusters. The clusters are stabilized.
• Stop Clustering after the fixed number of iterations
• The sum of the squared distance of each record to
its "representative mean" in each cluster is less than
some threshold value. The threshold value can be
calculated by using below mathematical equation
(2).
……… (2)
Where,
d is the distance
K = no. of clusters
xis the ith
object and mi is its mean.
4. IMPLEMENTATION
RESULTS
The algorithm is implemented in python
language. For coding and debugging purpose
Anaconda IDE is used.
2
1
)..(),( jij
n
j
i amaxd  
mx
),(2
,1 iCx
K
i dE i
mxx 
……….. (1)
n = number of attributes for an object,
After, assigning the objects to each cluster, now
(X, Y) value by
calculating mean values in the cluster. It can be
)/n
)/n
Now, repeat the process of assigning objects to
clusters by calculating distances as described above.
process will be continues till it meets
Convergence criteria can be described in 3 ways.
There is no change in the mean values of the
Stop Clustering after the fixed number of iterations
m of the squared distance of each record to
its "representative mean" in each cluster is less than
value. The threshold value can be
by using below mathematical equation
.……….. (1)
……… (2)
4. IMPLEMENTATION AND
is implemented in python
For coding and debugging purpose
Fig 4 Sample dataset
For processing stream data, we need stream data
set to be loaded as input to the programme. The
sample dataset is shown in
After applying preprocessing techniques
dataset is used as input for running of programme.
The following figures shows the results obtained.
Fig 5. Showing cluster information
Fig 6 Live graph consists of various clusters b
color indication
.……….. (1)
Fig 4 Sample dataset
For processing stream data, we need stream data
set to be loaded as input to the programme. The
in Fig 4.
preprocessing techniques, the
as input for running of programme.
figures shows the results obtained.
. Showing cluster information
ph consists of various clusters by
color indication-1
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO: 2236-6124
Page No:394
Here, in fig 5, it shows the cluster
after processing the data stream. In fig 6 and fig 7, it
shows live traffic updates in graphs format. In fig 8,
the database table is shown which stores the results
after real-time clustering is done.
Fig 7 Live graph consists of various clus
color indication-2
Fig 8 Stream database log table
CONCLUSION
In this paper, working of k-Means
algorithm for processing stream data is
is implemented using python language.
help of matplotlib dependency, graph
for visualization. Clusters information is shown as
textual data. The cluster densities are shown as
visual graphs. The results obtained are encouraging.
Only numerical data type attributes
present. There is a scope for improving the
considering more types of data (categorical, ordinal,
hybrid and etc..,).
REFERENCES
[1] Jiawei Han and Michelin Kamber ,”Data mining Concepts
and techniques”, second edition, Pg No. 383-
Here, in fig 5, it shows the cluster information
after processing the data stream. In fig 6 and fig 7, it
shows live traffic updates in graphs format. In fig 8,
the database table is shown which stores the results
ph consists of various clusters by
8 Stream database log table
Means clustering
algorithm for processing stream data is discussed. It
is implemented using python language. With the
graphs are drawn
Clusters information is shown as
textual data. The cluster densities are shown as
are encouraging.
are used at
ing the work by
considering more types of data (categorical, ordinal,
Data mining Concepts
-531, 2006.
[2] Sobhan Badiozamany, “Real
over sliding windows, Digital Comprehensive Summaries of
Uppsala Dissertations from the Faculty of Science and
Technology 1431”, ISBN 1651
[3] Georg Krempel, Ammar Shaker,
Data Stream Mining Research, IEEE transa
data”, vol 16, 1-9, pp. 2013.
[4] MadjidKhalilian, Norwati Mustapha, "
stream Clustering: Challenges and Issues"
[5] T. SoniMadulatha, “ACIJ:
algorithms”, vol. 2, 151-160, pp. 2011.
[6] Xiangliang Zhang, Cyril Furtlehner, Cécile Germain
Renaud, and MichèleSebag, "
knowledge and data engineering:
with Affinity Propagation", vol. 26, 1644
[7] Abhirup Chakraborty , School of Informatics and Computer
Science, Ajit Singh Department of Electrical and Computer
Engineering, “IEEE transactions,
Stream Joins In A Shared-Nothing Cluster"
[8] Springer Link, “Evolution of real
based on data stream mining"
[9] Fabio Fumarola, Anna Ciampi, Annalisa Appice, Donato
Malerba, “A Sliding Window Algorithm For Relational
Frequent Patterns Mining From Data Streams”
[10] Rayane El Sibai. “International Conference on Digital
Economy (ICDEc): Sampling Algorithms in Data Stream
Environments,pp. 2016.
[11] Srikanta Tirthapura1 and David P. Woodruff2, “
Random Sampling from Distributed Stream
Medi Manishankar
Masters of Technology in CVR
College of Engineering,
Hyderabad, India in Computer
Science
specialization. He will complete his
PG in 2018.
Dr. K. Venkateswara Rao
working as a professor in
Department of Computer
Science and Engineering at
CVR College of Engineering
Hyderabad since 2005.
Real-time data stream clustering
over sliding windows, Digital Comprehensive Summaries of
Uppsala Dissertations from the Faculty of Science and
1651-6214, pp. 2016.
Georg Krempel, Ammar Shaker, “Open Challenges for
Data Stream Mining Research, IEEE transactions on BIG
9, pp. 2013.
MadjidKhalilian, Norwati Mustapha, " IMECS: Data
stream Clustering: Challenges and Issues", vol. 1,pp. 2010.
ACIJ:Overview of Stream data
160, pp. 2011.
ang, Cyril Furtlehner, Cécile Germain-
Renaud, and MichèleSebag, "IEEE transactions on
knowledge and data engineering: Data Stream Clustering
vol. 26, 1644-1656.
Abhirup Chakraborty , School of Informatics and Computer
Ajit Singh Department of Electrical and Computer
IEEE transactions, Parallelizing Windowed
Nothing Cluster", pp. 2013.
Springer Link, “Evolution of real-time traffic applications
based on data stream mining"
arola, Anna Ciampi, Annalisa Appice, Donato
Malerba, “A Sliding Window Algorithm For Relational
Frequent Patterns Mining From Data Streams”,.
International Conference on Digital
Sampling Algorithms in Data Stream
Srikanta Tirthapura1 and David P. Woodruff2, “Optimal
Random Sampling from Distributed Stream Revisited”.
Manishankar is pursuing his
Masters of Technology in CVR
College of Engineering,
, India in Computer
Science and Engineering
specialization. He will complete his
Venkateswara Rao is
working as a professor in
Department of Computer
Science and Engineering at
CVR College of Engineering,
Hyderabad since 2005.
International Journal of Research
Volume 7, Issue IX, September/2018
ISSN NO: 2236-6124
Page No:395

More Related Content

What's hot

Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar reportmayurik19
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageIRJET Journal
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data miningAlexander Decker
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesEditor IJCATR
 
HITS: A History-Based Intelligent Transportation System
HITS: A History-Based Intelligent Transportation System HITS: A History-Based Intelligent Transportation System
HITS: A History-Based Intelligent Transportation System IJDKP
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data MiningScottperrone
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Miningtobiemuir
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data miningRohit Kumar
 
Information Technology Data Mining
Information Technology Data MiningInformation Technology Data Mining
Information Technology Data Miningsamiksha sharma
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big datasarfraznawaz
 
Survey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data MiningSurvey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data Miningijcsit
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

What's hot (19)

Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
The Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their UsageThe Big Data Importance – Tools and their Usage
The Big Data Importance – Tools and their Usage
 
11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining11.challenging issues of spatio temporal data mining
11.challenging issues of spatio temporal data mining
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Big Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New ChallengesBig Data Analytics: Recent Achievements and New Challenges
Big Data Analytics: Recent Achievements and New Challenges
 
HITS: A History-Based Intelligent Transportation System
HITS: A History-Based Intelligent Transportation System HITS: A History-Based Intelligent Transportation System
HITS: A History-Based Intelligent Transportation System
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data Mining
 
Key Principles Of Data Mining
Key Principles Of Data MiningKey Principles Of Data Mining
Key Principles Of Data Mining
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Unit 2
Unit 2Unit 2
Unit 2
 
Information Technology Data Mining
Information Technology Data MiningInformation Technology Data Mining
Information Technology Data Mining
 
F035431037
F035431037F035431037
F035431037
 
Identifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big dataIdentifying and analyzing the transient and permanent barriers for big data
Identifying and analyzing the transient and permanent barriers for big data
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Survey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data MiningSurvey of the Euro Currency Fluctuation by Using Data Mining
Survey of the Euro Currency Fluctuation by Using Data Mining
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
 
Data mining
Data miningData mining
Data mining
 

Similar to Mining Stream Data using k-Means clustering Algorithm

SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...ijdpsjournal
 
A New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesA New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesVenu Madhav
 
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train CoachesIRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train CoachesIRJET Journal
 
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET Journal
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data MiningIOSR Journals
 
Mining frequent itemsets (mfi) over
Mining frequent itemsets (mfi) overMining frequent itemsets (mfi) over
Mining frequent itemsets (mfi) overIJDKP
 
Paper id 25201431
Paper id 25201431Paper id 25201431
Paper id 25201431IJRAT
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Mr.Sameer Kumar Das
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSijistjournal
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAIJDKP
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAIJDKP
 
Big data – A Review
Big data – A ReviewBig data – A Review
Big data – A ReviewIRJET Journal
 
The Live: Stream Computing
The Live: Stream ComputingThe Live: Stream Computing
The Live: Stream ComputingIRJET Journal
 
Data stream mining techniques: a review
Data stream mining techniques: a reviewData stream mining techniques: a review
Data stream mining techniques: a reviewTELKOMNIKA JOURNAL
 
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...Alexander Decker
 
5. data mining tools and techniques a review--31-39
5. data mining tools and techniques  a review--31-395. data mining tools and techniques  a review--31-39
5. data mining tools and techniques a review--31-39Alexander Decker
 
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTA CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTIJwest
 

Similar to Mining Stream Data using k-Means clustering Algorithm (20)

SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
SEAMLESS AUTOMATION AND INTEGRATION OF MACHINE LEARNING CAPABILITIES FOR BIG ...
 
A New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association RulesA New Data Stream Mining Algorithm for Interestingness-rich Association Rules
A New Data Stream Mining Algorithm for Interestingness-rich Association Rules
 
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train CoachesIRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
IRJET- AC Duct Monitoring and Cleaning Vehicle for Train Coaches
 
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
Mining frequent itemsets (mfi) over
Mining frequent itemsets (mfi) overMining frequent itemsets (mfi) over
Mining frequent itemsets (mfi) over
 
Paper id 25201431
Paper id 25201431Paper id 25201431
Paper id 25201431
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
 
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICSA SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICS
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
 
MINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATAMINING TECHNIQUES FOR STREAMING DATA
MINING TECHNIQUES FOR STREAMING DATA
 
Big data – A Review
Big data – A ReviewBig data – A Review
Big data – A Review
 
The Live: Stream Computing
The Live: Stream ComputingThe Live: Stream Computing
The Live: Stream Computing
 
IJCSIT
IJCSITIJCSIT
IJCSIT
 
Data stream mining techniques: a review
Data stream mining techniques: a reviewData stream mining techniques: a review
Data stream mining techniques: a review
 
Uint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdfUint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdf
 
Uint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdfUint-4 Mining Data Stream.pdf
Uint-4 Mining Data Stream.pdf
 
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
 
5. data mining tools and techniques a review--31-39
5. data mining tools and techniques  a review--31-395. data mining tools and techniques  a review--31-39
5. data mining tools and techniques a review--31-39
 
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENTA CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
A CLOUD BASED ARCHITECTURE FOR WORKING ON BIG DATA WITH WORKFLOW MANAGEMENT
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

Mining Stream Data using k-Means clustering Algorithm

  • 1. Mining Stream Data using k-Means clustering Algorithm 1 Medi Manishankar, 2 Dr. K. Venkateswara Rao 1 PG Scholar in CSE, CVRCE, Hyderabad, India, manishankarmedi@gmail.com 2 Professor of CSE, CVRCE, Hyderabad, India, kvenkat.cse@gmail.com Abstract-Stream data is the data coming continuously with time-stamped sequences. It may come from various locations with varying update rates and with high dimensionality. With the reason, stream data flows fast and changes quickly, sometimes it may not be possible to process every element, and storage is also a typical issue. Numerous methods such as random sampling, sliding windows and histograms are available to handle stream data. Stream data has many applications such as Traffic data analysis, Telecommunications and network data analysis, stock market data analysis. Data mining techniques such as association analysis, classification and clustering are generally used for stream data analysis. In this paper, k-means clustering algorithm is used for mining urban road traffic stream data of a particular city. The stream data is handled using sliding window technique. The clusters are shown graphically using visualization techniques in python. The clusters are updated in-real-time to enable people to understand behavior of the traffic. The results are described in this paper. Keywords: Stream data, clustering, sliding windows, k-Means clustering, urban road traffic 1. INTRODUCTION In this era of big data, the data is produced at extremely high speed and in huge volumes. In many cases the data is in the form of data streams, making the approach of storing and querying the data offline is infeasible. According to the Digital Universe study [3] over 2.8ZB of data were created and processed in 2012, with a projected increase of 15 times by 2020. This growth in the production of digital data results from our surrounding environments equipped with lot of sensors. The sensors sense the information and send various parameters forcing it need to be analyzed online. In contrast, a data stream is large volume [9] of data arriving continuously with a time stamp and it is either unnecessary or impractical to store the data in some form of memory. For many recent applications [3], the word data stream is best suitable than a dataset. Examples of domains producing stream data are Telecommunications and networking, Stock marketing, e-commerce and etc. Considering telecommunications; it needs to analyze the calling records, Weblogs and Web page clicks as they arrive continuously. Stream data, having characteristics such as it comes continuously, is difficult to handle due to its characteristics such as it temporally ordered, fast-changing and potentially infinite. So the biggest challenge in stream data is handling it before it is going to be expired [3], [4], and performing required analysis. In handling, we have to consider the single scan best output algorithm since multiple scans are not possible in the stream data. Sometimes stream data is likely to have high dimensionality as the number of attributes for each record in some large, multipurpose applications like satellite mounted remote sensors, real-time robotic applications. And For each record, there might be a chance of the high range of possible values for each attribute. International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO: 2236-6124 Page No:390
  • 2. For effective processing of stream data, new data structures, techniques, and algorithms are needed. Because we do not have an infinite amount of space to store stream data, we often do tradeoff between accuracy and storage. That is, we generally are willing to settle for approximate rather than exact solutions. Many data stream based algorithms compute an approximate answer within a factor e of the actual answer, with high probability. Generally, as the approximation factor goes down, the space requirements go up. Some common data structures and techniques used for stream data processing are sliding window, random sampling, and histograms [1], [10]. Throughout the stream data processing techniques, the sliding window is the feasible technique since it makes decisions based on recent data instead of running computations on all of the data seen so far. And it runs computations in main memory. So, these criteria result in effective and reducing memory usage which helps in limiting storage consumption. The main problem in data stream mining is discovering knowledge or trends using supervised or unsupervised learning techniques. In unsupervised learning, clustering leads to the discovery of hidden information. Clustering can be defined as the grouping of similar objects based on their properties (attributes). Clustering should be done in such a way that the objects are very similar with the other objects within the same cluster and dissimilar with objects in the other clusters. This paper describes clustering of urban road traffic stream data. Our objective is visualization and discovering density of the traffic. This will helps the users who traveling in the urban road getting to know the busy roads and for the traffic authorities to keep track of road traffic in real time, to make decisions in less time to avoid traffic congestion. Updating visual graphs in real-time to enable people to easily understand traffic patterns. 2. STREAM DATA PROCESSING In the following sections, we discuss stream data processing techniques, applications, challenges, and issues. Data Stream Management Systems also discussed. 2.1 Techniques for Stream Data Processing For processing stream data there exist, numerous methods present like random sampling [1], [10], [11] by taking a sample of data from the entire data stream reservoir sampling is the inside technique in the random sampling , sliding window performing computations on recent data [2], [9], partitioning histograms which involves the data stream in to buckets and further process. In the sliding window concept, the main idea is instead of running computations on all the data so far, it runs computations on the recent data and makes decisions [1]. More formally, at every time t, a new data element arrives. This element “expires” at time t +w, where w is the window “size” or length. The sliding window model is useful for stocks or sensor networks, where only recent events are important. It also reduces memory requirements because only a small window of data is stored. In the sliding window, there are two types of methods present: Count- based sliding window and time- based sliding window. In the count based sliding window, we fix the number of records to be processed at one pass, and in the time stamp sliding window, after how much time new records to be entered into sliding window will mentioned. Clustering streaming data is particularly challenging because it involves dynamically merging and splitting evolving clusters over which statistical summaries are maintained in real-time as the stream progresses [2]. 2.2 Data Stream Processing Applications, Challenges, and Issues In the recent days, many applications [4], [5] are producing massive amounts of data. International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO: 2236-6124 Page No:391
  • 3. For example, Telecommunications and network area, where the central database system keep tracking on user calling records, web clicks, network monitoring, fraud calls, call drops and etc.., With all these data as they arrive online need to be analyzed in order to make decisions which leads to rectifying the problems in real time. In the Business area, it is very common in the financial market or stock market number of buyers and sellers performing the actions by exchanging equities. So, there need to analyze the data in the real-time to give knowledge to the people about the trends in the stock market. In the urban road traffic, numbers of vehicles are traveling on the road. With the IoT comes into the real world, automobiles equipped with advanced sensors that help in collect data useful for stream data analysis which lead us to discover knowledge on density areas in the real-time road traffic, and decision making in modifications of signaling systems for hassle-free traveling to the users. But, while dealing with stream data there are some challenges and issues arises that data is coming from complex environments with continuous flow and high dimensionality. Sometimes there might be missing data in some records. So, here three principal challenges [3] are involving: Volume, Velocity, and Volatility. Here data preprocessing is important to avoid noisy data, redundant data, outlier elimination and fill fabricate missing values. However, some other challenges [4] also to be mentioned are • Data Uncertainty • Data type treatment • Cluster validity • High Dimensionality of Data. 2.3 Data Stream Management Systems (DSMS) For processing stream data, Data stream management is needed. DataStream Management systems are software systems that manage and support querying of continuous data streams. In Data stream management system, it handles the data as input and processes it with the help of query processor stores the results in the databases. Data stream management systems [2] [7] emerge to support a large class of applications such as stack trade marketing, network traffic monitoring, sensor data analysis, real-time data warehousing, etc.., DSMS process continues queries over a high volume and time-varying data streams. In a Continuous Query (CQ) system, a user registers queries specifying their interests over unbounded, streaming data sources. A query engine continuously evaluates the query results such as a new data arrives from the sources and delivers unbounded streaming outputs to the appropriate users. A core operator in a CQ system is sliding window join among streams. The following figure depicts the DSMS architecture. Fig 1 Data stream Management Systems (DSMS) Various components of the DSMS are as follows. Local DB- summary data, historical data, Indices. Meta-database stores schema used to express queries. The local database maintains in main memory data representing current data summaries as well as historical data using main-memory indexes for efficient search. The continuous query processor optimizes and executes continuous queries over the data streams by accessing the meta-data and the local database. Continuous queries can span both streaming and local data. International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO: 2236-6124 Page No:392
  • 4. But, there are some requirements to achieve efficiency with DSMS that affects scalability issues. Those are 1. Arriving elements have to be processed on the fly. 2. Processing data streams on a single pass or scan on data. 3. Processing engine needs to have low latency and high throughput. 4. For high streams rates and large window sizes, a sliding window join might consume a large amount of memory to store tuples. 3. DATA STREAM CLUSTERING Clustering is the process of grouping similar objects into the same group. A cluster is a collection of data objects that are similar to one another with in the same cluster and are dissimilar to the objects in other clusters. Clustering algorithms, in general, have been categorized into five types: partitioning, hierarchical, density-based, grid-based, and model- based. Due to the need of computation on the data in a single scan, partitioning based algorithms are best suitable which are based on Divide-and- Conquer strategy (One scan Divide-and- Conquer approaches have been widely used to cluster data streams [6]). In Partitioning based techniques, K-Means and K-Medoids [1] [6], are very popular in the data mining world. Recently, Alkermann proposed an online K-Means algorithm by maintaining a small sketch of the input using the merge reduce technique. K-means algorithm is the best suitable algorithm while dealing with numerical data. If variables are huge, then K-Means is most of the times computationally faster than hierarchical clustering, if we keep ‘k’ small. 3.1 k-Means Clustering Algorithm for Stream Data Processing Below are the main steps in K-Means clustering algorithm. Algorithm: k-Means Clustering Algorithm Input: D={r1, r2, r3, …,rn} and K value Output: K clusters Method: Initialize mean values for K-clusters randomly; Say m1, m2, m3, …, mk; Repeat Assign each record in D to the cluster based on similarity; Calculate new mean for each cluster; Until convergence criteria are met; Following flow chart in Fig 3 describes the algorithm. Fig 3 k-Means clustering algorithm flow chart While performing clustering of data with the k- Means clustering algorithm, first it need to set the ‘k’ value. It is known as number of clusters. After fixing of k value, initialize the centroid values of each cluster with a random value. For mining urban road traffic, it is needed to take two-coordinate values as the location has two coordinates in two dimensional views. Say, for each cluster set a cluster centroid value C(x, y). Here, centroid is a data point (imaginary or real) at the center of a cluster. Here, in clustering first, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. The distance can be calculated by using equation (1). International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO: 2236-6124 Page No:393
  • 5. ……….. ( Where, n = number of attributes for an object, ajis the jth attribute, mi..ajis the mean of the jth attribute After, assigning the objects to each cluster, now update the cluster centroid Ci(X, calculating mean values in the cluster. It can be calculated by Cx= (x1, x2, x3,……xn)/n Cy= (y1, y2, y3,……yn)/n Ci = (Cx, Cy) Now, repeat the process of assigning clusters by calculating distances as described This process will be continues till it meets convergence criteria. Convergence criteria can be described in 3 ways. • There is no change in the mean values of the clusters. The clusters are stabilized. • Stop Clustering after the fixed number of iterations • The sum of the squared distance of each record to its "representative mean" in each cluster is less than some threshold value. The threshold value can be calculated by using below mathematical equation (2). ……… (2) Where, d is the distance K = no. of clusters xis the ith object and mi is its mean. 4. IMPLEMENTATION RESULTS The algorithm is implemented in python language. For coding and debugging purpose Anaconda IDE is used. 2 1 )..(),( jij n j i amaxd   mx ),(2 ,1 iCx K i dE i mxx  ……….. (1) n = number of attributes for an object, After, assigning the objects to each cluster, now (X, Y) value by calculating mean values in the cluster. It can be )/n )/n Now, repeat the process of assigning objects to clusters by calculating distances as described above. process will be continues till it meets Convergence criteria can be described in 3 ways. There is no change in the mean values of the Stop Clustering after the fixed number of iterations m of the squared distance of each record to its "representative mean" in each cluster is less than value. The threshold value can be by using below mathematical equation .……….. (1) ……… (2) 4. IMPLEMENTATION AND is implemented in python For coding and debugging purpose Fig 4 Sample dataset For processing stream data, we need stream data set to be loaded as input to the programme. The sample dataset is shown in After applying preprocessing techniques dataset is used as input for running of programme. The following figures shows the results obtained. Fig 5. Showing cluster information Fig 6 Live graph consists of various clusters b color indication .……….. (1) Fig 4 Sample dataset For processing stream data, we need stream data set to be loaded as input to the programme. The in Fig 4. preprocessing techniques, the as input for running of programme. figures shows the results obtained. . Showing cluster information ph consists of various clusters by color indication-1 International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO: 2236-6124 Page No:394
  • 6. Here, in fig 5, it shows the cluster after processing the data stream. In fig 6 and fig 7, it shows live traffic updates in graphs format. In fig 8, the database table is shown which stores the results after real-time clustering is done. Fig 7 Live graph consists of various clus color indication-2 Fig 8 Stream database log table CONCLUSION In this paper, working of k-Means algorithm for processing stream data is is implemented using python language. help of matplotlib dependency, graph for visualization. Clusters information is shown as textual data. The cluster densities are shown as visual graphs. The results obtained are encouraging. Only numerical data type attributes present. There is a scope for improving the considering more types of data (categorical, ordinal, hybrid and etc..,). REFERENCES [1] Jiawei Han and Michelin Kamber ,”Data mining Concepts and techniques”, second edition, Pg No. 383- Here, in fig 5, it shows the cluster information after processing the data stream. In fig 6 and fig 7, it shows live traffic updates in graphs format. In fig 8, the database table is shown which stores the results ph consists of various clusters by 8 Stream database log table Means clustering algorithm for processing stream data is discussed. It is implemented using python language. With the graphs are drawn Clusters information is shown as textual data. The cluster densities are shown as are encouraging. are used at ing the work by considering more types of data (categorical, ordinal, Data mining Concepts -531, 2006. [2] Sobhan Badiozamany, “Real over sliding windows, Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1431”, ISBN 1651 [3] Georg Krempel, Ammar Shaker, Data Stream Mining Research, IEEE transa data”, vol 16, 1-9, pp. 2013. [4] MadjidKhalilian, Norwati Mustapha, " stream Clustering: Challenges and Issues" [5] T. SoniMadulatha, “ACIJ: algorithms”, vol. 2, 151-160, pp. 2011. [6] Xiangliang Zhang, Cyril Furtlehner, Cécile Germain Renaud, and MichèleSebag, " knowledge and data engineering: with Affinity Propagation", vol. 26, 1644 [7] Abhirup Chakraborty , School of Informatics and Computer Science, Ajit Singh Department of Electrical and Computer Engineering, “IEEE transactions, Stream Joins In A Shared-Nothing Cluster" [8] Springer Link, “Evolution of real based on data stream mining" [9] Fabio Fumarola, Anna Ciampi, Annalisa Appice, Donato Malerba, “A Sliding Window Algorithm For Relational Frequent Patterns Mining From Data Streams” [10] Rayane El Sibai. “International Conference on Digital Economy (ICDEc): Sampling Algorithms in Data Stream Environments,pp. 2016. [11] Srikanta Tirthapura1 and David P. Woodruff2, “ Random Sampling from Distributed Stream Medi Manishankar Masters of Technology in CVR College of Engineering, Hyderabad, India in Computer Science specialization. He will complete his PG in 2018. Dr. K. Venkateswara Rao working as a professor in Department of Computer Science and Engineering at CVR College of Engineering Hyderabad since 2005. Real-time data stream clustering over sliding windows, Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and 1651-6214, pp. 2016. Georg Krempel, Ammar Shaker, “Open Challenges for Data Stream Mining Research, IEEE transactions on BIG 9, pp. 2013. MadjidKhalilian, Norwati Mustapha, " IMECS: Data stream Clustering: Challenges and Issues", vol. 1,pp. 2010. ACIJ:Overview of Stream data 160, pp. 2011. ang, Cyril Furtlehner, Cécile Germain- Renaud, and MichèleSebag, "IEEE transactions on knowledge and data engineering: Data Stream Clustering vol. 26, 1644-1656. Abhirup Chakraborty , School of Informatics and Computer Ajit Singh Department of Electrical and Computer IEEE transactions, Parallelizing Windowed Nothing Cluster", pp. 2013. Springer Link, “Evolution of real-time traffic applications based on data stream mining" arola, Anna Ciampi, Annalisa Appice, Donato Malerba, “A Sliding Window Algorithm For Relational Frequent Patterns Mining From Data Streams”,. International Conference on Digital Sampling Algorithms in Data Stream Srikanta Tirthapura1 and David P. Woodruff2, “Optimal Random Sampling from Distributed Stream Revisited”. Manishankar is pursuing his Masters of Technology in CVR College of Engineering, , India in Computer Science and Engineering specialization. He will complete his Venkateswara Rao is working as a professor in Department of Computer Science and Engineering at CVR College of Engineering, Hyderabad since 2005. International Journal of Research Volume 7, Issue IX, September/2018 ISSN NO: 2236-6124 Page No:395