Project title
ABSTRACT:
 Constrained spectral clustering (CSC) algorithms have
shown great promise in significantly improving
clustering accuracy by encoding side information into
spectral clustering algorithms. However, existing CSC
algorithms are inefficient in handling moderate and
large datasets. In this paper, we aim to develop a
scalable and efficient CSC algorithm by integrating
sparse coding based graph construction into a
framework called constrained normalized cuts.
EXISTING SYSTEM:
 Data in a wide variety of areas tend to large scales. For
many traditional learning based data mining
algorithms, it is a big challenge to efficiently mine
knowledge from the fast increasing data such as
information streams, images and even videos.
 To over-come the challenge, it is important to develop
scalable learning algorithms.
DISADVANTAGES OF EXISTING
SYSTEM:
 Straightforward integration of the constrained
normalized cuts and the sparse coding based graph
construction and the formulated scalable constrained
normalized-cuts problem.
PROPOSED SYSTEM:
 In this project, we develop an efficient and scalable
CSC algorithm that can well handle moderate and
large datasets. The SCACS algorithm can be
understood as a scalable version of the well-designed
but less efficient algorithm known as Flexible Con-
strained Spectral Clustering (FCSC).
 To our best knowledge, our algorithm is the first
efficient and scalable version in this area, which is
derived by an integration of two recent studies, the
constrained normalized cuts and the graph
construction method based on sparse coding.
ADVANTAGES OF PROPOSED
SYSTEM:
 We randomly sample sclabelled instances from a given
input dataset, and then obtain based on the rules of
The clustering accuracy is evaluated by the best
matching rate(ACC).
 Let h be the resulting label vector obtained from a
clustering algorithm. Let g be the ground truth label
vector.
HARDWARE REQUIREMENTS:
 System : Pentium IV 2.4 GHz.
 Hard Disk : 40 GB.
 Monitor : 15 VGA Colour.
 Mouse : Logitech.
 Ram : 512 Mb.
SOFTWARE REQUIREMENTS:
 Operating system : Windows XP/7.
 Coding Language : JAVA/J2EE
 IDE : Eclipse
 Database : MYSQL
Modules with their explanation
 A. Text anomaly detection
 B. Link anomaly detection
 C. Decision factor
A. Text anomaly detection
 Dataset of social networking site like Facebook, tweeter is
given to module of text anomaly detection. Content
preprocessing is next step which consists of many other
processes as follows:
 Word extraction: Words are extracted from text shared
by user over social networking site.
 Stemming: Variant forms of a word are reduced to a
common form. Stemming is the process of retrieving root
or stem of word.
 Weight assignment to word: Whatever words extracted
from previous steps are assigned weight to them depending
on prediction made from word.
 Frequency of words: how many times particular words
appear in a given time period is calculated.
B. Link anomaly detection
 Dataset of social networking site is also given to link
anomaly detection module. A step performed in this
module is as follows:
 a)Clustering of vertices having same features: We
can do clustering of vertices depending on same
communication behavior and build profile for each
cluster. Individual vertex profiles are also built
depending on the communication behavior of a vertex.
C. Decision factor:
 Result obtained from link anomaly module and text
anomaly module is compared in decision factor and
final anomaly is predicted.
SDLC
 Spiral Model design
The spiral model has four phases. A software project repeatedly passes
through these phases in iterations called Spirals.
 Identification
 This phase starts with gathering the business requirements in the
baseline spiral. In the subsequent spirals as the product matures,
identification of system requirements, subsystem requirements and
unit requirements are all done in this phase.
 This also includes understanding the system requirements by
continuous communication between the customer and the system
analyst. At the end of the spiral the product is deployed in the
identified market.
 Following is a diagrammatic representation of spiral model listing the
activities in each phase.
Design :Design phase starts with the conceptual design
in the baseline spiral and involves architectural design,
logical design of modules, physical product design and
final design in the subsequent spirals.
 Construct or Build
Construct phase refers to production of the actual
software product at every spiral. In the baseline spiral
when the product is just thought of and the design is
being developed a POC (Proof of Concept) is developed
in this phase to get customer feedback.
Then in the subsequent spirals with higher clarity on
requirements and design details a working model of the
software called build is produced with a version
number. These builds are sent to customer for
feedback.
 Evaluation and Risk Analysis
 Risk Analysis includes identifying, estimating, and monitoring
technical feasibility and management risks, such as schedule
slippage and cost overrun. After testing the build, at the end of first
iteration, the customer evaluates the software and provides feedback.
 Based on the customer evaluation, software development process
enters into the next iteration and subsequently follows the linear
approach to implement the feedback suggested by the customer. The
process of iterations along the spiral continues throughout the life of
the software.
Use Case Diagram
Active Diagram
Data Flow Diagram
Key Based Detection
Twitter Trends
Location Based Detection
Retweet Count
LocationSearch Word
Level 0:
Level 1:
Data Flow Diagram
Key Based
Detection
Spectral
Clustering
Location Based
Detection
Retweet
Count
Location Based
Detection
Spectral Clustering
Rewet Count
Aggregation
Search Word
Key Based Detection
Level 2:
ER Diagram
USER
father_nam
e
weight
height
skin_colo
r
languages
dob
birth_place
birth_state
age
Industry
gendername
c_id
mother_name
current_city
siblings
FILM dor
film_name
c_id
act
search
search search
LOCATION
location
texthash_tag
name
retweet
count
id
created at HASH TAG
location
texthash_tag
name
retweet
count
id
created at
RETWEET
location
texthash_tag
name
retweet
count
id
created at
1
N
N
N
N
NN
N
Conclusion:
 We have developed a new k-way scalable constrained
spectral clustering algorithm based on a closed-form
integration of the constrained normalized cuts and the
sparse coding based graph construction.
 with less side information, our algorithm can obtain
significant improvements in accuracy compared to the
unsupervised baseline
 with less computational time, our algorithm can
obtain high clustering accuracies close to those of the
state-of-the-art
Scalable constrained spectral clustering

Scalable constrained spectral clustering

  • 1.
  • 2.
    ABSTRACT:  Constrained spectralclustering (CSC) algorithms have shown great promise in significantly improving clustering accuracy by encoding side information into spectral clustering algorithms. However, existing CSC algorithms are inefficient in handling moderate and large datasets. In this paper, we aim to develop a scalable and efficient CSC algorithm by integrating sparse coding based graph construction into a framework called constrained normalized cuts.
  • 3.
    EXISTING SYSTEM:  Datain a wide variety of areas tend to large scales. For many traditional learning based data mining algorithms, it is a big challenge to efficiently mine knowledge from the fast increasing data such as information streams, images and even videos.  To over-come the challenge, it is important to develop scalable learning algorithms.
  • 4.
    DISADVANTAGES OF EXISTING SYSTEM: Straightforward integration of the constrained normalized cuts and the sparse coding based graph construction and the formulated scalable constrained normalized-cuts problem.
  • 5.
    PROPOSED SYSTEM:  Inthis project, we develop an efficient and scalable CSC algorithm that can well handle moderate and large datasets. The SCACS algorithm can be understood as a scalable version of the well-designed but less efficient algorithm known as Flexible Con- strained Spectral Clustering (FCSC).  To our best knowledge, our algorithm is the first efficient and scalable version in this area, which is derived by an integration of two recent studies, the constrained normalized cuts and the graph construction method based on sparse coding.
  • 6.
    ADVANTAGES OF PROPOSED SYSTEM: We randomly sample sclabelled instances from a given input dataset, and then obtain based on the rules of The clustering accuracy is evaluated by the best matching rate(ACC).  Let h be the resulting label vector obtained from a clustering algorithm. Let g be the ground truth label vector.
  • 7.
    HARDWARE REQUIREMENTS:  System: Pentium IV 2.4 GHz.  Hard Disk : 40 GB.  Monitor : 15 VGA Colour.  Mouse : Logitech.  Ram : 512 Mb.
  • 8.
    SOFTWARE REQUIREMENTS:  Operatingsystem : Windows XP/7.  Coding Language : JAVA/J2EE  IDE : Eclipse  Database : MYSQL
  • 9.
    Modules with theirexplanation  A. Text anomaly detection  B. Link anomaly detection  C. Decision factor
  • 10.
    A. Text anomalydetection  Dataset of social networking site like Facebook, tweeter is given to module of text anomaly detection. Content preprocessing is next step which consists of many other processes as follows:  Word extraction: Words are extracted from text shared by user over social networking site.  Stemming: Variant forms of a word are reduced to a common form. Stemming is the process of retrieving root or stem of word.  Weight assignment to word: Whatever words extracted from previous steps are assigned weight to them depending on prediction made from word.  Frequency of words: how many times particular words appear in a given time period is calculated.
  • 11.
    B. Link anomalydetection  Dataset of social networking site is also given to link anomaly detection module. A step performed in this module is as follows:  a)Clustering of vertices having same features: We can do clustering of vertices depending on same communication behavior and build profile for each cluster. Individual vertex profiles are also built depending on the communication behavior of a vertex.
  • 12.
    C. Decision factor: Result obtained from link anomaly module and text anomaly module is compared in decision factor and final anomaly is predicted.
  • 13.
    SDLC  Spiral Modeldesign The spiral model has four phases. A software project repeatedly passes through these phases in iterations called Spirals.  Identification  This phase starts with gathering the business requirements in the baseline spiral. In the subsequent spirals as the product matures, identification of system requirements, subsystem requirements and unit requirements are all done in this phase.  This also includes understanding the system requirements by continuous communication between the customer and the system analyst. At the end of the spiral the product is deployed in the identified market.  Following is a diagrammatic representation of spiral model listing the activities in each phase.
  • 14.
    Design :Design phasestarts with the conceptual design in the baseline spiral and involves architectural design, logical design of modules, physical product design and final design in the subsequent spirals.  Construct or Build Construct phase refers to production of the actual software product at every spiral. In the baseline spiral when the product is just thought of and the design is being developed a POC (Proof of Concept) is developed in this phase to get customer feedback. Then in the subsequent spirals with higher clarity on requirements and design details a working model of the software called build is produced with a version number. These builds are sent to customer for feedback.
  • 15.
     Evaluation andRisk Analysis  Risk Analysis includes identifying, estimating, and monitoring technical feasibility and management risks, such as schedule slippage and cost overrun. After testing the build, at the end of first iteration, the customer evaluates the software and provides feedback.  Based on the customer evaluation, software development process enters into the next iteration and subsequently follows the linear approach to implement the feedback suggested by the customer. The process of iterations along the spiral continues throughout the life of the software.
  • 16.
  • 17.
  • 18.
    Data Flow Diagram KeyBased Detection Twitter Trends Location Based Detection Retweet Count LocationSearch Word Level 0: Level 1:
  • 19.
    Data Flow Diagram KeyBased Detection Spectral Clustering Location Based Detection Retweet Count Location Based Detection Spectral Clustering Rewet Count Aggregation Search Word Key Based Detection Level 2:
  • 20.
    ER Diagram USER father_nam e weight height skin_colo r languages dob birth_place birth_state age Industry gendername c_id mother_name current_city siblings FILM dor film_name c_id act search searchsearch LOCATION location texthash_tag name retweet count id created at HASH TAG location texthash_tag name retweet count id created at RETWEET location texthash_tag name retweet count id created at 1 N N N N NN N
  • 24.
    Conclusion:  We havedeveloped a new k-way scalable constrained spectral clustering algorithm based on a closed-form integration of the constrained normalized cuts and the sparse coding based graph construction.  with less side information, our algorithm can obtain significant improvements in accuracy compared to the unsupervised baseline  with less computational time, our algorithm can obtain high clustering accuracies close to those of the state-of-the-art