Scalable constrained spectral clustering

ABSTRACT:
 Constrained spectral clustering (CSC) algorithms have
shown great promise in significantly improving
clustering accuracy by encoding side information into
spectral clustering algorithms. However, existing CSC
algorithms are inefficient in handling moderate and
large datasets. In this paper, we aim to develop a
scalable and efficient CSC algorithm by integrating
sparse coding based graph construction into a
framework called constrained normalized cuts.

EXISTING SYSTEM:
 Data in a wide variety of areas tend to large scales. For
many traditional learning based data mining
algorithms, it is a big challenge to efficiently mine
knowledge from the fast increasing data such as
information streams, images and even videos.
 To over-come the challenge, it is important to develop
scalable learning algorithms.

DISADVANTAGES OF EXISTING
SYSTEM:
 Straightforward integration of the constrained
normalized cuts and the sparse coding based graph
construction and the formulated scalable constrained
normalized-cuts problem.

PROPOSED SYSTEM:
 In this project, we develop an efficient and scalable
CSC algorithm that can well handle moderate and
large datasets. The SCACS algorithm can be
understood as a scalable version of the well-designed
but less efficient algorithm known as Flexible Con-
strained Spectral Clustering (FCSC).
 To our best knowledge, our algorithm is the first
efficient and scalable version in this area, which is
derived by an integration of two recent studies, the
constrained normalized cuts and the graph
construction method based on sparse coding.

ADVANTAGES OF PROPOSED
SYSTEM:
 We randomly sample sclabelled instances from a given
input dataset, and then obtain based on the rules of
The clustering accuracy is evaluated by the best
matching rate(ACC).
 Let h be the resulting label vector obtained from a
clustering algorithm. Let g be the ground truth label
vector.

HARDWARE REQUIREMENTS:
 System : Pentium IV 2.4 GHz.
 Hard Disk : 40 GB.
 Monitor : 15 VGA Colour.
 Mouse : Logitech.
 Ram : 512 Mb.

SOFTWARE REQUIREMENTS:
 Operating system : Windows XP/7.
 Coding Language : JAVA/J2EE
 IDE : Eclipse
 Database : MYSQL

Modules with their explanation
 A. Text anomaly detection
 B. Link anomaly detection
 C. Decision factor

A. Text anomaly detection
 Dataset of social networking site like Facebook, tweeter is
given to module of text anomaly detection. Content
preprocessing is next step which consists of many other
processes as follows:
 Word extraction: Words are extracted from text shared
by user over social networking site.
 Stemming: Variant forms of a word are reduced to a
common form. Stemming is the process of retrieving root
or stem of word.
 Weight assignment to word: Whatever words extracted
from previous steps are assigned weight to them depending
on prediction made from word.
 Frequency of words: how many times particular words
appear in a given time period is calculated.

B. Link anomaly detection
 Dataset of social networking site is also given to link
anomaly detection module. A step performed in this
module is as follows:
 a)Clustering of vertices having same features: We
can do clustering of vertices depending on same
communication behavior and build profile for each
cluster. Individual vertex profiles are also built
depending on the communication behavior of a vertex.

C. Decision factor:
 Result obtained from link anomaly module and text
anomaly module is compared in decision factor and
final anomaly is predicted.

SDLC
 Spiral Model design
The spiral model has four phases. A software project repeatedly passes
through these phases in iterations called Spirals.
 Identification
 This phase starts with gathering the business requirements in the
baseline spiral. In the subsequent spirals as the product matures,
identification of system requirements, subsystem requirements and
unit requirements are all done in this phase.
 This also includes understanding the system requirements by
continuous communication between the customer and the system
analyst. At the end of the spiral the product is deployed in the
identified market.
 Following is a diagrammatic representation of spiral model listing the
activities in each phase.

Design :Design phase starts with the conceptual design
in the baseline spiral and involves architectural design,
logical design of modules, physical product design and
final design in the subsequent spirals.
 Construct or Build
Construct phase refers to production of the actual
software product at every spiral. In the baseline spiral
when the product is just thought of and the design is
being developed a POC (Proof of Concept) is developed
in this phase to get customer feedback.
Then in the subsequent spirals with higher clarity on
requirements and design details a working model of the
software called build is produced with a version
number. These builds are sent to customer for
feedback.

 Evaluation and Risk Analysis
 Risk Analysis includes identifying, estimating, and monitoring
technical feasibility and management risks, such as schedule
slippage and cost overrun. After testing the build, at the end of first
iteration, the customer evaluates the software and provides feedback.
 Based on the customer evaluation, software development process
enters into the next iteration and subsequently follows the linear
approach to implement the feedback suggested by the customer. The
process of iterations along the spiral continues throughout the life of
the software.

Data Flow Diagram
Key Based Detection
Twitter Trends
Location Based Detection
Retweet Count
LocationSearch Word
Level 0:
Level 1:

Data Flow Diagram
Key Based
Detection
Spectral
Clustering
Location Based
Detection
Retweet
Count
Location Based
Detection
Spectral Clustering
Rewet Count
Aggregation
Search Word
Key Based Detection
Level 2:

ER Diagram
USER
father_nam
e
weight
height
skin_colo
r
languages
dob
birth_place
birth_state
age
Industry
gendername
c_id
mother_name
current_city
siblings
FILM dor
film_name
c_id
act
search
search search
LOCATION
location
texthash_tag
name
retweet
count
id
created at HASH TAG
location
texthash_tag
name
retweet
count
id
created at
RETWEET
location
texthash_tag
name
retweet
count
id
created at
1
N
N
N
N
NN
N

Conclusion:
 We have developed a new k-way scalable constrained
spectral clustering algorithm based on a closed-form
integration of the constrained normalized cuts and the
sparse coding based graph construction.
 with less side information, our algorithm can obtain
significant improvements in accuracy compared to the
unsupervised baseline
 with less computational time, our algorithm can
obtain high clustering accuracies close to those of the
state-of-the-art

Scalable constrained spectral clustering

Scalable constrained spectral clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Scalable constrained spectral clustering

Similar to Scalable constrained spectral clustering (20)

More from Nishanth Harapanahalli

More from Nishanth Harapanahalli (18)

Recently uploaded

Recently uploaded (20)

Scalable constrained spectral clustering