SlideShare a Scribd company logo
1 of 21
Scalable Multiple Clustering
of
High-dimensional Data
Under the Supervision of Submitted by
Mrs. Sunita Beniwal Sahil Kakkar
Assistant Professor M. Tech. Candidate
Department of CSE Reg. No. 14011018
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Contents
• Introduction
• Equivalence of Multiple and Overlapping Clustering
• Problem Description
• Scalability Issues
• Motivation for Community Detection
• Problem Formulation
• CDCC Algorithm
• Simulation Results
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Introduction
• A clustered view of the dataset defines a “partition”. It is also called a
“clustering” in literature.
• The idea of Multiple Clustering is based on the notion that there can
be more than one such partitions possible in a dataset. These multiple
clusterings provide multiple views of the dataset.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
A Single Partition
Partition based on algorithm-types Partition based on applications
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Computer
Vision
Anamoly
DetectionPattern
Recognition
Profiling
(Segementation)Advertising/
Recommendars
Deep
Learning
Neural
Network
Bayesian
Clustering
Dimensionality
Reduction
Equivalence of Multiple and Overlapping
Clustering
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Overlapping Clustering Solution
Partition A Partition B
Problem Description
• Co-clustering refers to jointly clustering of samples and features alike.
• Formally, given a set O of objects and the set F of features, with |O| =
n & |F| = d, a co-cluster C is a triple (O’, F’, R), with O’ ⊆ O, F’ ⊆ F & R
⊆ O × F that can be described as:
C (O’, F’, R) = {O’ × F’ | o ϵ O’, f ϵ F’, (o, f) ϵ R},
where the relation R defines the structure-type of the co-cluster.
• Notice that augmenting the relation R extends the similarity-based
measure to a more general & flexible “pattern-based framework”.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Scalability Issues
• Pair-wise distance metric renders the running cost of any
clustering algorithm prohibitively expensive due to two
limitations:
• High-dimensionality of samples
• Large number of sample-pairs to compare, nC2
• To solve high-dimensionality problem, randomized dimensionality
reduction techniques like MinHash or Weighted Minwise Sampling
are used. They summarize the m-dimensional features to very lesser
k-dimensional (k <<< m) feature-hash such that the inter-sample
similarity is preserved.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Theory of Locality Sensitive Hashing
• Use Jaccard Index (𝕁) as the similarity measure for Locality
Sensitive Hashing.
• Given the randomized hash (MH1 or WMS2) h(x) of the
sample x, for any pair of samples x and y, the probability of
hash collision is given by:
Pr ℎ 𝑥 = ℎ 𝑦 = 𝑓 𝕁 𝑥, 𝑦
where f is monotonically increasing.
• This approach of comparing samples uses hash-table and is
linear in number of samples, hence avoids nC2 number of
comparisons.
1MH = Minhash signature
2WMS = Weighted Minwise Sampling signature
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
h:
Motivation of Community Detection through
Co-clustering
• Community = Dense interaction among nodes
= Co-cluster formed by densely connected subgraph
• Current community detection algorithms have at least one of the
following problems:
• Size/number of communities to be specified in advance
• Partition-based (Communities cannot overlap)
• Fuzzy membership method does not scale to large networks
• Community structure heavily depends on choice of seed nodes
• Limited scalability due to pair-wise comparisons
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Problem Formulation
• Let S be set of nodes interacting in a network. For any S’ ⊆ S, a
community C can be described as
C(S’, S’, R) = {S’× S’ ∣ s ∈ S’, (s, s) ∈ R}
where relation R encodes binary connectivity (cliquish-ness) among
nodes of the community.
Two Communities:
{b, c, d} and {a, e}
a b c d e
c d b e a
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
a b c d e
a 1 0 0 0 1
b 0 1 1 1 0
c 0 1 1 1 0
d 0 1 1 1 0
e 1 0 0 0 1
Density relaxation for practical purposes
• In real-life applications, completely-connected subgraphs are rare. So
community is expressed as a sub-graph that is at least as dense as ρ
(lower bound):
C(S’, S’, R’) = {S’×S’∣ s ∈ S’, ∣R’∣ ≥ ρ ⋅ ∣S’×S’∣}
where
R’ = {(s, s) ∈ R ∣ s ∈ S’}
and the fraction ρ denotes the minimum threshold on density.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
CDCC Algorithm
• Input: Binary adjacency matrix Dn×n
Tuning parameters: B and K (for dimensionality reduction)
Thresholds: Jmin (min density) and Vmin (min size)
• Output: Detected Communities
• Three phases:
1. Generating row-clusters
2. Identifying corresponding column-clusters
3. Extracting communities
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Phase 1: Generating
row-clusters using LSH
Hash-table for row-clusters
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Phase 2: Identifying
corresponding
column-clusters
Hash-table for column-clusters
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Phase 3: Extracting
Communities from
co-clusters
The union of row-set and column-set
of a co-cluster represents a detected
community.
The overlapping columns (nodes)
account for the overlap among
communities.
U =
Overlapping
CommunitiesRow-clusters Column-clusters
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Runtime Complexity
• n = nodes in the graph
• m = number of co-clusters detected (m << n)
• d = edges per node (avg. non-zeros per row)
• Phase 1: O(BKn + dn)
• Phase 2: O(dn)
• Phase 3: O(m)
• Thus, overall complexity = O((BK+d)n)
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
LFR Benchmark Network Generator
• Generated 11 network graphs, each with 100 nodes and average
degree = 19, gradually increasing overlapping nodes.
• CPN denotes average Class membership Per Node.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Simulation results (NMI, F-measure & NVI) on
LFR benchmarks
• Network shift (change in number of classes) is detected by disruptive
rise in the NMI & F-measure and corresponding fall in NVI.
Number of classes
shown as yellow bars
and right-side y-axis
displays the scale as
number of classes
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Number of overlapping nodes discovered
against benchmark on
• CDCC recovers maximum overlapping nodes just after the shifts in
the community structure, as the newly discovered community
compensates for the otherwise degrading quality.
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Total running time of CDCC with increasing
graph size
• The running time analysis confirms theoretical scalability derived
earlier as O((BK+d)n).
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
Thank You
Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
PAPER PUBLISHED
Kakkar S. and Beniwal S., “Discovering Overlapping Community Structure in Networks through Co-clustering”, in
IEEE International Conference on Inventive Computation Technologies, Coimbatore, TN, India, 2016. [Accepted]

More Related Content

What's hot

Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureRajesh Piryani
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsTomonari Masada
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiersKrish_ver2
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGA COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGIJORCS
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...IRJET Journal
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Zachary Thomas
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster AnalysisDerek Kane
 

What's hot (20)

Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior Distributions
 
Clique and sting
Clique and stingClique and sting
Clique and sting
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
50120130406039
5012013040603950120130406039
50120130406039
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
KNN
KNNKNN
KNN
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGA COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Clique
Clique Clique
Clique
 
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
Enhancing Classification Accuracy of K-Nearest Neighbors Algorithm using Gain...
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Data Science - Part VII - Cluster Analysis
Data Science - Part VII -  Cluster AnalysisData Science - Part VII -  Cluster Analysis
Data Science - Part VII - Cluster Analysis
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 

Similar to Discovering Overlapping Community Structure in Networks through Co-clustering

Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data StreamIRJET Journal
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...IRJET Journal
 
Analysis of mass based and density based clustering techniques on numerical d...
Analysis of mass based and density based clustering techniques on numerical d...Analysis of mass based and density based clustering techniques on numerical d...
Analysis of mass based and density based clustering techniques on numerical d...Alexander Decker
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
 
Multi-class K-support Vector Nearest Neighbor for Mango Leaf Classification
Multi-class K-support Vector Nearest Neighbor for Mango Leaf ClassificationMulti-class K-support Vector Nearest Neighbor for Mango Leaf Classification
Multi-class K-support Vector Nearest Neighbor for Mango Leaf ClassificationTELKOMNIKA JOURNAL
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding IJCERT
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemDing Li
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxPrakasBhowmik
 
Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...
Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...
Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...IJERA Editor
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
 
Health-e-Child CaseReasoner
Health-e-Child CaseReasonerHealth-e-Child CaseReasoner
Health-e-Child CaseReasonerGaborRendes
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Loc Nguyen
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstracttsysglobalsolutions
 
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsAn Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsWaqas Tariq
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepSanjanaSaxena17
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 

Similar to Discovering Overlapping Community Structure in Networks through Co-clustering (20)

Clustering Algorithms for Data Stream
Clustering Algorithms for Data StreamClustering Algorithms for Data Stream
Clustering Algorithms for Data Stream
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
Analysis of mass based and density based clustering techniques on numerical d...
Analysis of mass based and density based clustering techniques on numerical d...Analysis of mass based and density based clustering techniques on numerical d...
Analysis of mass based and density based clustering techniques on numerical d...
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
1376846406 14447221
1376846406  144472211376846406  14447221
1376846406 14447221
 
Multi-class K-support Vector Nearest Neighbor for Mango Leaf Classification
Multi-class K-support Vector Nearest Neighbor for Mango Leaf ClassificationMulti-class K-support Vector Nearest Neighbor for Mango Leaf Classification
Multi-class K-support Vector Nearest Neighbor for Mango Leaf Classification
 
An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding   An Image representation using Compressive Sensing and Arithmetic Coding
An Image representation using Compressive Sensing and Arithmetic Coding
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
SkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptxSkNoushadddoja_28100119039.pptx
SkNoushadddoja_28100119039.pptx
 
Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...
Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...
Transportation Problem with Pentagonal Intuitionistic Fuzzy Numbers Solved Us...
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
 
Health-e-Child CaseReasoner
Health-e-Child CaseReasonerHealth-e-Child CaseReasoner
Health-e-Child CaseReasoner
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent PatternsAn Optimal Approach For Knowledge Protection In Structured Frequent Patterns
An Optimal Approach For Knowledge Protection In Structured Frequent Patterns
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat ViagraToko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagraadet6151
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一hwhqz6r1y
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7komalsharmaa480
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malangadet6151
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 

Recently uploaded (20)

Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat ViagraToko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
Toko Jual Viagra Asli Di Salatiga 081229400522 Obat Kuat Viagra
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 

Discovering Overlapping Community Structure in Networks through Co-clustering

  • 1. Scalable Multiple Clustering of High-dimensional Data Under the Supervision of Submitted by Mrs. Sunita Beniwal Sahil Kakkar Assistant Professor M. Tech. Candidate Department of CSE Reg. No. 14011018 Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 2. Contents • Introduction • Equivalence of Multiple and Overlapping Clustering • Problem Description • Scalability Issues • Motivation for Community Detection • Problem Formulation • CDCC Algorithm • Simulation Results Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 3. Introduction • A clustered view of the dataset defines a “partition”. It is also called a “clustering” in literature. • The idea of Multiple Clustering is based on the notion that there can be more than one such partitions possible in a dataset. These multiple clusterings provide multiple views of the dataset. Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016 A Single Partition
  • 4. Partition based on algorithm-types Partition based on applications Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016 Computer Vision Anamoly DetectionPattern Recognition Profiling (Segementation)Advertising/ Recommendars Deep Learning Neural Network Bayesian Clustering Dimensionality Reduction
  • 5. Equivalence of Multiple and Overlapping Clustering Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016 Overlapping Clustering Solution Partition A Partition B
  • 6. Problem Description • Co-clustering refers to jointly clustering of samples and features alike. • Formally, given a set O of objects and the set F of features, with |O| = n & |F| = d, a co-cluster C is a triple (O’, F’, R), with O’ ⊆ O, F’ ⊆ F & R ⊆ O × F that can be described as: C (O’, F’, R) = {O’ × F’ | o ϵ O’, f ϵ F’, (o, f) ϵ R}, where the relation R defines the structure-type of the co-cluster. • Notice that augmenting the relation R extends the similarity-based measure to a more general & flexible “pattern-based framework”. Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 7. Scalability Issues • Pair-wise distance metric renders the running cost of any clustering algorithm prohibitively expensive due to two limitations: • High-dimensionality of samples • Large number of sample-pairs to compare, nC2 • To solve high-dimensionality problem, randomized dimensionality reduction techniques like MinHash or Weighted Minwise Sampling are used. They summarize the m-dimensional features to very lesser k-dimensional (k <<< m) feature-hash such that the inter-sample similarity is preserved. Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 8. Theory of Locality Sensitive Hashing • Use Jaccard Index (𝕁) as the similarity measure for Locality Sensitive Hashing. • Given the randomized hash (MH1 or WMS2) h(x) of the sample x, for any pair of samples x and y, the probability of hash collision is given by: Pr ℎ 𝑥 = ℎ 𝑦 = 𝑓 𝕁 𝑥, 𝑦 where f is monotonically increasing. • This approach of comparing samples uses hash-table and is linear in number of samples, hence avoids nC2 number of comparisons. 1MH = Minhash signature 2WMS = Weighted Minwise Sampling signature Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016 h:
  • 9. Motivation of Community Detection through Co-clustering • Community = Dense interaction among nodes = Co-cluster formed by densely connected subgraph • Current community detection algorithms have at least one of the following problems: • Size/number of communities to be specified in advance • Partition-based (Communities cannot overlap) • Fuzzy membership method does not scale to large networks • Community structure heavily depends on choice of seed nodes • Limited scalability due to pair-wise comparisons Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 10. Problem Formulation • Let S be set of nodes interacting in a network. For any S’ ⊆ S, a community C can be described as C(S’, S’, R) = {S’× S’ ∣ s ∈ S’, (s, s) ∈ R} where relation R encodes binary connectivity (cliquish-ness) among nodes of the community. Two Communities: {b, c, d} and {a, e} a b c d e c d b e a Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016 a b c d e a 1 0 0 0 1 b 0 1 1 1 0 c 0 1 1 1 0 d 0 1 1 1 0 e 1 0 0 0 1
  • 11. Density relaxation for practical purposes • In real-life applications, completely-connected subgraphs are rare. So community is expressed as a sub-graph that is at least as dense as ρ (lower bound): C(S’, S’, R’) = {S’×S’∣ s ∈ S’, ∣R’∣ ≥ ρ ⋅ ∣S’×S’∣} where R’ = {(s, s) ∈ R ∣ s ∈ S’} and the fraction ρ denotes the minimum threshold on density. Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 12. CDCC Algorithm • Input: Binary adjacency matrix Dn×n Tuning parameters: B and K (for dimensionality reduction) Thresholds: Jmin (min density) and Vmin (min size) • Output: Detected Communities • Three phases: 1. Generating row-clusters 2. Identifying corresponding column-clusters 3. Extracting communities Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 13. Phase 1: Generating row-clusters using LSH Hash-table for row-clusters Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 14. Phase 2: Identifying corresponding column-clusters Hash-table for column-clusters Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 15. Phase 3: Extracting Communities from co-clusters The union of row-set and column-set of a co-cluster represents a detected community. The overlapping columns (nodes) account for the overlap among communities. U = Overlapping CommunitiesRow-clusters Column-clusters Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 16. Runtime Complexity • n = nodes in the graph • m = number of co-clusters detected (m << n) • d = edges per node (avg. non-zeros per row) • Phase 1: O(BKn + dn) • Phase 2: O(dn) • Phase 3: O(m) • Thus, overall complexity = O((BK+d)n) Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 17. LFR Benchmark Network Generator • Generated 11 network graphs, each with 100 nodes and average degree = 19, gradually increasing overlapping nodes. • CPN denotes average Class membership Per Node. Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 18. Simulation results (NMI, F-measure & NVI) on LFR benchmarks • Network shift (change in number of classes) is detected by disruptive rise in the NMI & F-measure and corresponding fall in NVI. Number of classes shown as yellow bars and right-side y-axis displays the scale as number of classes Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 19. Number of overlapping nodes discovered against benchmark on • CDCC recovers maximum overlapping nodes just after the shifts in the community structure, as the newly discovered community compensates for the otherwise degrading quality. Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 20. Total running time of CDCC with increasing graph size • The running time analysis confirms theoretical scalability derived earlier as O((BK+d)n). Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016
  • 21. Thank You Dissertation, GJUS&T, Hisar-125001, Haryana, India, 17th September, 2016 PAPER PUBLISHED Kakkar S. and Beniwal S., “Discovering Overlapping Community Structure in Networks through Co-clustering”, in IEEE International Conference on Inventive Computation Technologies, Coimbatore, TN, India, 2016. [Accepted]