SlideShare a Scribd company logo
Size Matters: Cardinality-Constrained
Clustering & Outlier Detection
Napat RUJEERAPAIBOON
Department of Industrial Systems Engineering & Management
National University of Singapore
k–Means Clustering
Ÿ k–means clustering is NP-hard1
.
min
°n
i1
°k
j1 πij }ξi ¡ζj }2
s. t. πij € t0, 1u, ζj € Rd
°k
j1 πij  1 di  1, . . . , n
Ÿ In practice, k–means heuristic2
produce solutions quickly.
Fix πij ùñ ζj  average of ξi with πij  1
Fix ζj ùñ πij  1 if ζj is the closest center to ξi
1
Aloise et al. (2009)
2
Arthur  Vassilvitskii (2007)
Challenges
Ÿ k–means heuristics suffers from several shortcomings.
Technical challenges
1 Slow runtime.3
2 Unknown suboptimality.
Practical challenges
1 Skewed clustering.4
2 Sensitivity to outliers.5
Ÿ Skewed clustering is unfavorable in many applications.
3
Arthur  Vassilvitskii (2006)
4
Bennett et al. (2000)
5
Chawla  Gionis (2013)
Cardinality Constraints
Implicit Cdn.
Market Segmentation
Distributed Computing
Explicit Cdn.
Category Management
Vehicle Routing
Outlier Det.
Fraud Detection
Medical Diagnosis
Outlier Detection
Ÿ k–means (25.21)
Ÿ Balanced k–means (54.27)
Ÿ Balanced k–means + Outlier detection (1.97)
Cardinality Constraints
Ÿ Introduce dummy (0th
cluster) for outliers.
Ÿ tnj uk
j1 and n0 denote sizes of regular and dummy clusters.
Ÿ Cardinality–constrained k–means clustering.
min
°n
i1
°k
j1 πij }ξi ¡ζj }2
s. t. πij € t0, 1u, ζj € Rd
°k
j0 πij  1 di  1, . . . , n
°n
i1 πij  nj dj  0, . . . , k
Cardinality Constraints
Ÿ Introduce dummy (0th
cluster) for outliers.
Ÿ tnj uk
j1 and n0 denote sizes of regular and dummy clusters.
Ÿ Cardinality–constrained k–means clustering.
min
°n
i1
°k
j1 πij }ξi ¡ζj }2
s. t. πij € t0, 1u, ζj € Rd
°k
j0 πij  1 di  1, . . . , n
°n
i1 πij  nj dj  0, . . . , k
Ÿ A remedy for the practical challenges .
Cardinality Constraints
Ÿ Introduce dummy (0th
cluster) for outliers.
Ÿ tnj uk
j1 and n0 denote sizes of regular and dummy clusters.
Ÿ Cardinality–constrained k–means clustering.
min
°n
i1
°k
j1 πij }ξi ¡ζj }2
s. t. πij € t0, 1u, ζj € Rd
°k
j0 πij  1 di  1, . . . , n
°n
i1 πij  nj dj  0, . . . , k
Ÿ A remedy for the practical challenges .
Ÿ What about the technical challenges?
Linearization  Convexification
min
°n
i1
°k
j1 πij }ξi ¡ζj }2
s. t. πij € t0, 1u, ζj € Rd
°k
j0 πij  1 di  1, . . . , n
°n
i1 πij  nj dj  0, . . . , k
Ÿ The problem is NP-hard.
Ÿ Heuristics for biconvex optimization can still be used.6
Ÿ No runtime/optimality guarantees.
6
Bennett et al. (2000)
Linearization  Convexification
Conic Relaxations MILP Feasible Solution
enlarge feasible set
rounding algorithm
recovery
guarantee
Ÿ The problem is NP-hard.
Ÿ Heuristics for biconvex optimization can still be used.6
Ÿ No runtime/optimality guarantees.
Ÿ We propose a convex relaxation that comes with guarantees.
6
Bennett et al. (2000)
Linearization
Ÿ Equivalent MINLP reformulation.
min
°k
j1
°n
i,i1
1
1
2nj
πij πi1
j }ξi ¡ξi1 }2
r costpπqs
s. t. πij € t0, 1u
°k
j0 πij  1 di  1, . . . , n
°n
i1 πij  nj dj  0, . . . , k
(P)
Ÿ The products πij πi1
j can be linearized, resulting in an MILP.
Zha et al. (2001)
Convex Relaxation
Ÿ Apply the following variable transformations:
xj : 2πj ¡1, D  rdii1 s :

}ξi ¡ξi1 }2

.
Ÿ The MILP admits an equivalent non-linear reformulation.
min
°k
j1
1
8nj
p1  xj qp1  xj q , D
s. t. xj € t¡1,  1un
°k
j0 xj  p1 ¡kq1
1 xj  2nj ¡n dj  0, . . . , k
Ÿ Introduce Mj  xj xj to linearize the objective function
°k
j1
1
8nj
11  1xj  xj 1  Mj , D .
Convex Relaxation
Ÿ In doing so, non-convexity is relegated to the constraints
xj € t¡1,  1un
, Mj  xj xj .
Ÿ The resulting MINLP can be relaxed to an SDP (RSDP)
min
°k
j1
1
8nj
11  1xj  xj 1  Mj , D
s. t. xj € Rn
, Mj € Sn
°k
j0 xj  p1 ¡kq1
1 xj  2nj ¡n dj  0, . . . , k
diagpMj q  1, Mj © xj xj dj  0, . . . , k
(RSDP)
Goemans  Williamson (1995)
Convex Relaxation
Ÿ In doing so, non-convexity is relegated to the constraints
xj € t¡1,  1un
, Mj  xj xj .
Ÿ The resulting MINLP can be relaxed to an SDP (RSDP)
min
°k
j1
1
8nj
11  1xj  xj 1  Mj , D
s. t. xj € Rn
, Mj € Sn
°k
j0 xj  p1 ¡kq1
1 xj  2nj ¡n dj  0, . . . , k
diagpMj q  1, Mj © xj xj dj  0, . . . , k
(RSDP)
Ÿ Unfortunately, this SDP relaxation is very weak.
Goemans  Williamson (1995)
Valid Inequalities
Ÿ Strengthen RSDP with the valid cuts.
Mj 

1 γ
γ 1

, xj 

α
β

Ex. γ  0.4
Mj  xj xj Mj © xj xj Mj © xj xj + VCs
Anstreicher (2009)
Valid Inequalities
Ÿ Strengthen RSDP with the valid cuts.
Mj 

1 γ
γ 1

, xj 

α
β

Ex. γ  0.4
Mj  xj xj Mj © xj xj Mj © xj xj + VCs
Ÿ These cuts are instrumental to proving optimality guarantees.
Ÿ As a by-product, we also have an LP relaxation RLP.
Anstreicher (2009)
Convex Relaxation
Theorem 1
We have min RLP ¤ min RSDP ¤ min P.
Ÿ RLP  RSDP are polynomial-time solvable, whereas P is not.
Next Steps:
Ÿ How to construct ˜π
ij € t0, 1u feasible in P from x
ij ?
Ÿ How to gauge the quality of the obtained ˜π
ij ?
Rounding Algorithm
Ÿ Recall that x
ij € r¡1, 1s:
1
2 p1  x
ij q  P pξi € CLj q
Ÿ Solve the following linear assignment problem to retrieve ˜π.
max
°n
i1
°k
j1 πij

1
2 p1  x
ij q

s. t. πij € t0, 1u
°k
j0 πij  1 di  1, . . . , n
°n
i1 πij  nj dj  0, . . . , k
Ÿ LAP7
is an MILP with totally unimodular matrix (4 tractable).
7
Burkard et al. (2009)
Optimality Gap
Ÿ Our approach gives an a posteriori estimate of optimality gap.
min RLP ¤ min RSDP ¤ min P  costpπ
q ¤ costp˜π
q
Ÿ Under perfect separation condition, the optimality gap vanishes.
Elhamifar et al. (2012)
Recovery Guarantee
Theorem 2 (Perfect Separation)
We have
tightness
hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj
min RLP  min RSDP  min P  costpπ
q  costp˜π
qloooooooooooooooooomoooooooooooooooooon
LAP-recovery
.
Proof Sketch (Tightness):
1 Distinguish outlier (M
0) and regular (M
j ) clusters.
2 The RLP/RSDP can be solved analytically.
Recovery Guarantee
Theorem 2 (Perfect Separation)
We have
tightness
hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj
min RLP  min RSDP  min P  costpπ
q  costp˜π
qloooooooooooooooooomoooooooooooooooooon
LAP-recovery
.
Proof Sketch (Tightness):
M
0 M
j
Recovery Guarantee
Theorem 2 (Perfect Separation)
We have
tightness
hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj
min RLP  min RSDP  min P  costpπ
q  costp˜π
qloooooooooooooooooomoooooooooooooooooon
LAP-recovery
.
Proof Sketch (Recovery):
1 Distinguish outlier (M
0) and regular (M
j ) clusters.
2 The RLP/RSDP can be solved analytically.
3 The LAP exploits strong signal in M
0 and M
j .
Recovery Guarantee
Theorem 2 (Perfect Separation)
We have
tightness
hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj
min RLP  min RSDP  min P  costpπ
q  costp˜π
qloooooooooooooooooomoooooooooooooooooon
LAP-recovery
.
Proof Sketch (Recovery):
M
0 M
j
Numerical Experiments I
Ÿ Perform cardinality-constrained clustering on classification
datasets8
.
Ÿ tnj uk
j1 : the number of true class occurrences.
Ÿ Compare our SDP/LP+LAP approach with biconvex heuristic9
.
Ÿ Optimality gaps yielded by the LP+LAP approach are À 20.6%.
Ÿ Optimality gaps yielded by the SDP+LAP approach are À 2.9%.
Ÿ The SDP approach is competitive with the biconvex heuristic.
8
UCI repository
9
Bennett, K. et al. (2000)
Numerical Experiments II
Ÿ Perform outlier detection on the breast cancer dataset10
.
Ÿ Varying the number of malignant cancers n0.
Ÿ Calculate prediction accuracy, false positives  false negatives.
0 50 100 150 200 250 300 350 400
0
10
20
30
40
50
60
70
80
90
100
Á 80% prediction accuracy, À 3.3% optimality gap.
10
UCI repository
References
Ÿ Arthur, D. and Vassilvitskii, S.
K-means++: the advantages of careful seeding.
Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 2007.
Ÿ Bennett, K., Bradley, P., Demiriz, A.
Constrained k–Means Clustering.
Microsoft Technical Report, 2000.
Ÿ Rujeerapaiboon, N., Schindler, K., Kuhn, D., Wiesemann, W.
Size matters: Cardinality-constrained clustering and outlier detection
via conic optimization.
SIAM Journal on Optimization 29(2), 2019.
napat.rujeerapaiboon@nus.edu.sg
Special thanks to artwork from tPopcorns Arts, Maxim Basinski, Business strategy,
Freepik, Prosymbols, Vectors Market, Madebyoliver, Alfredo Hernandez, Devil,
Roundiconsu@Flaticon.

More Related Content

Similar to Conic Clustering

Scenario Reduction
Scenario ReductionScenario Reduction
Scenario Reduction
Napat Rujeerapaiboon
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa
 
Global optimization
Global optimizationGlobal optimization
Global optimization
bpenalver
 
Mixed-integer and Disjunctive Programming - Ignacio E. Grossmann
Mixed-integer and Disjunctive Programming - Ignacio E. GrossmannMixed-integer and Disjunctive Programming - Ignacio E. Grossmann
Mixed-integer and Disjunctive Programming - Ignacio E. Grossmann
CAChemE
 
A generic method for modeling accelerated life testing data
A generic method for modeling accelerated life testing dataA generic method for modeling accelerated life testing data
A generic method for modeling accelerated life testing data
ASQ Reliability Division
 
1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr
Adam Lee Perelman
 
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable Environments
Emanuele Ghelfi
 
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment ProblemA New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
Kim Daniels
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Ryan B Harvey, CSDP, CSM
 
My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...
Alexander Litvinenko
 
DAA.pdf
DAA.pdfDAA.pdf
DAA.pdf
DAA.pdfDAA.pdf
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Naoki Hayashi
 
lecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdflecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdf
AnaNeacsu5
 
A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...
narmo
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
researchinventy
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
researchinventy
 
Ch5 epfm
Ch5 epfmCh5 epfm
Ch5 epfm
yashdeep nimje
 
Tanfani testi-alvarez presentation final
Tanfani testi-alvarez presentation finalTanfani testi-alvarez presentation final
Tanfani testi-alvarez presentation final
Rene Alvarez
 
Pclsp ntnu
Pclsp ntnuPclsp ntnu
Pclsp ntnu
Kjetil Haugen
 

Similar to Conic Clustering (20)

Scenario Reduction
Scenario ReductionScenario Reduction
Scenario Reduction
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Global optimization
Global optimizationGlobal optimization
Global optimization
 
Mixed-integer and Disjunctive Programming - Ignacio E. Grossmann
Mixed-integer and Disjunctive Programming - Ignacio E. GrossmannMixed-integer and Disjunctive Programming - Ignacio E. Grossmann
Mixed-integer and Disjunctive Programming - Ignacio E. Grossmann
 
A generic method for modeling accelerated life testing data
A generic method for modeling accelerated life testing dataA generic method for modeling accelerated life testing data
A generic method for modeling accelerated life testing data
 
1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr1 d,2d laplace inversion of lr nmr
1 d,2d laplace inversion of lr nmr
 
Reinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable EnvironmentsReinforcement Learning in Configurable Environments
Reinforcement Learning in Configurable Environments
 
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment ProblemA New Lagrangian Relaxation Approach To The Generalized Assignment Problem
A New Lagrangian Relaxation Approach To The Generalized Assignment Problem
 
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data SetsMethods of Manifold Learning for Dimension Reduction of Large Data Sets
Methods of Manifold Learning for Dimension Reduction of Large Data Sets
 
My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...My presentation at University of Nottingham "Fast low-rank methods for solvin...
My presentation at University of Nottingham "Fast low-rank methods for solvin...
 
DAA.pdf
DAA.pdfDAA.pdf
DAA.pdf
 
DAA.pdf
DAA.pdfDAA.pdf
DAA.pdf
 
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
 
lecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdflecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdf
 
A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Ch5 epfm
Ch5 epfmCh5 epfm
Ch5 epfm
 
Tanfani testi-alvarez presentation final
Tanfani testi-alvarez presentation finalTanfani testi-alvarez presentation final
Tanfani testi-alvarez presentation final
 
Pclsp ntnu
Pclsp ntnuPclsp ntnu
Pclsp ntnu
 

Recently uploaded

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Dr. Vinod Kumar Kanvaria
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
sayalidalavi006
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
NgcHiNguyn25
 

Recently uploaded (20)

BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5Community pharmacy- Social and preventive pharmacy UNIT 5
Community pharmacy- Social and preventive pharmacy UNIT 5
 
Life upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for studentLife upper-Intermediate B2 Workbook for student
Life upper-Intermediate B2 Workbook for student
 

Conic Clustering

  • 1. Size Matters: Cardinality-Constrained Clustering & Outlier Detection Napat RUJEERAPAIBOON Department of Industrial Systems Engineering & Management National University of Singapore
  • 2. k–Means Clustering Ÿ k–means clustering is NP-hard1 . min °n i1 °k j1 πij }ξi ¡ζj }2 s. t. πij € t0, 1u, ζj € Rd °k j1 πij 1 di 1, . . . , n Ÿ In practice, k–means heuristic2 produce solutions quickly. Fix πij ùñ ζj average of ξi with πij 1 Fix ζj ùñ πij 1 if ζj is the closest center to ξi 1 Aloise et al. (2009) 2 Arthur Vassilvitskii (2007)
  • 3. Challenges Ÿ k–means heuristics suffers from several shortcomings. Technical challenges 1 Slow runtime.3 2 Unknown suboptimality. Practical challenges 1 Skewed clustering.4 2 Sensitivity to outliers.5 Ÿ Skewed clustering is unfavorable in many applications. 3 Arthur Vassilvitskii (2006) 4 Bennett et al. (2000) 5 Chawla Gionis (2013)
  • 4. Cardinality Constraints Implicit Cdn. Market Segmentation Distributed Computing Explicit Cdn. Category Management Vehicle Routing Outlier Det. Fraud Detection Medical Diagnosis
  • 5. Outlier Detection Ÿ k–means (25.21) Ÿ Balanced k–means (54.27) Ÿ Balanced k–means + Outlier detection (1.97)
  • 6. Cardinality Constraints Ÿ Introduce dummy (0th cluster) for outliers. Ÿ tnj uk j1 and n0 denote sizes of regular and dummy clusters. Ÿ Cardinality–constrained k–means clustering. min °n i1 °k j1 πij }ξi ¡ζj }2 s. t. πij € t0, 1u, ζj € Rd °k j0 πij 1 di 1, . . . , n °n i1 πij nj dj 0, . . . , k
  • 7. Cardinality Constraints Ÿ Introduce dummy (0th cluster) for outliers. Ÿ tnj uk j1 and n0 denote sizes of regular and dummy clusters. Ÿ Cardinality–constrained k–means clustering. min °n i1 °k j1 πij }ξi ¡ζj }2 s. t. πij € t0, 1u, ζj € Rd °k j0 πij 1 di 1, . . . , n °n i1 πij nj dj 0, . . . , k Ÿ A remedy for the practical challenges .
  • 8. Cardinality Constraints Ÿ Introduce dummy (0th cluster) for outliers. Ÿ tnj uk j1 and n0 denote sizes of regular and dummy clusters. Ÿ Cardinality–constrained k–means clustering. min °n i1 °k j1 πij }ξi ¡ζj }2 s. t. πij € t0, 1u, ζj € Rd °k j0 πij 1 di 1, . . . , n °n i1 πij nj dj 0, . . . , k Ÿ A remedy for the practical challenges . Ÿ What about the technical challenges?
  • 9. Linearization Convexification min °n i1 °k j1 πij }ξi ¡ζj }2 s. t. πij € t0, 1u, ζj € Rd °k j0 πij 1 di 1, . . . , n °n i1 πij nj dj 0, . . . , k Ÿ The problem is NP-hard. Ÿ Heuristics for biconvex optimization can still be used.6 Ÿ No runtime/optimality guarantees. 6 Bennett et al. (2000)
  • 10. Linearization Convexification Conic Relaxations MILP Feasible Solution enlarge feasible set rounding algorithm recovery guarantee Ÿ The problem is NP-hard. Ÿ Heuristics for biconvex optimization can still be used.6 Ÿ No runtime/optimality guarantees. Ÿ We propose a convex relaxation that comes with guarantees. 6 Bennett et al. (2000)
  • 11. Linearization Ÿ Equivalent MINLP reformulation. min °k j1 °n i,i1 1 1 2nj πij πi1 j }ξi ¡ξi1 }2 r costpπqs s. t. πij € t0, 1u °k j0 πij 1 di 1, . . . , n °n i1 πij nj dj 0, . . . , k (P) Ÿ The products πij πi1 j can be linearized, resulting in an MILP. Zha et al. (2001)
  • 12. Convex Relaxation Ÿ Apply the following variable transformations: xj : 2πj ¡1, D rdii1 s : }ξi ¡ξi1 }2 . Ÿ The MILP admits an equivalent non-linear reformulation. min °k j1 1 8nj p1  xj qp1  xj q , D s. t. xj € t¡1,  1un °k j0 xj p1 ¡kq1 1 xj 2nj ¡n dj 0, . . . , k Ÿ Introduce Mj xj xj to linearize the objective function °k j1 1 8nj 11  1xj  xj 1  Mj , D .
  • 13. Convex Relaxation Ÿ In doing so, non-convexity is relegated to the constraints xj € t¡1,  1un , Mj xj xj . Ÿ The resulting MINLP can be relaxed to an SDP (RSDP) min °k j1 1 8nj 11  1xj  xj 1  Mj , D s. t. xj € Rn , Mj € Sn °k j0 xj p1 ¡kq1 1 xj 2nj ¡n dj 0, . . . , k diagpMj q 1, Mj © xj xj dj 0, . . . , k (RSDP) Goemans Williamson (1995)
  • 14. Convex Relaxation Ÿ In doing so, non-convexity is relegated to the constraints xj € t¡1,  1un , Mj xj xj . Ÿ The resulting MINLP can be relaxed to an SDP (RSDP) min °k j1 1 8nj 11  1xj  xj 1  Mj , D s. t. xj € Rn , Mj € Sn °k j0 xj p1 ¡kq1 1 xj 2nj ¡n dj 0, . . . , k diagpMj q 1, Mj © xj xj dj 0, . . . , k (RSDP) Ÿ Unfortunately, this SDP relaxation is very weak. Goemans Williamson (1995)
  • 15. Valid Inequalities Ÿ Strengthen RSDP with the valid cuts. Mj 1 γ γ 1 , xj α β Ex. γ 0.4 Mj xj xj Mj © xj xj Mj © xj xj + VCs Anstreicher (2009)
  • 16. Valid Inequalities Ÿ Strengthen RSDP with the valid cuts. Mj 1 γ γ 1 , xj α β Ex. γ 0.4 Mj xj xj Mj © xj xj Mj © xj xj + VCs Ÿ These cuts are instrumental to proving optimality guarantees. Ÿ As a by-product, we also have an LP relaxation RLP. Anstreicher (2009)
  • 17. Convex Relaxation Theorem 1 We have min RLP ¤ min RSDP ¤ min P. Ÿ RLP RSDP are polynomial-time solvable, whereas P is not. Next Steps: Ÿ How to construct ˜π ij € t0, 1u feasible in P from x ij ? Ÿ How to gauge the quality of the obtained ˜π ij ?
  • 18. Rounding Algorithm Ÿ Recall that x ij € r¡1, 1s: 1 2 p1  x ij q P pξi € CLj q Ÿ Solve the following linear assignment problem to retrieve ˜π. max °n i1 °k j1 πij 1 2 p1  x ij q s. t. πij € t0, 1u °k j0 πij 1 di 1, . . . , n °n i1 πij nj dj 0, . . . , k Ÿ LAP7 is an MILP with totally unimodular matrix (4 tractable). 7 Burkard et al. (2009)
  • 19. Optimality Gap Ÿ Our approach gives an a posteriori estimate of optimality gap. min RLP ¤ min RSDP ¤ min P costpπ q ¤ costp˜π q Ÿ Under perfect separation condition, the optimality gap vanishes. Elhamifar et al. (2012)
  • 20. Recovery Guarantee Theorem 2 (Perfect Separation) We have tightness hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj min RLP min RSDP min P costpπ q costp˜π qloooooooooooooooooomoooooooooooooooooon LAP-recovery . Proof Sketch (Tightness): 1 Distinguish outlier (M 0) and regular (M j ) clusters. 2 The RLP/RSDP can be solved analytically.
  • 21. Recovery Guarantee Theorem 2 (Perfect Separation) We have tightness hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj min RLP min RSDP min P costpπ q costp˜π qloooooooooooooooooomoooooooooooooooooon LAP-recovery . Proof Sketch (Tightness): M 0 M j
  • 22. Recovery Guarantee Theorem 2 (Perfect Separation) We have tightness hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj min RLP min RSDP min P costpπ q costp˜π qloooooooooooooooooomoooooooooooooooooon LAP-recovery . Proof Sketch (Recovery): 1 Distinguish outlier (M 0) and regular (M j ) clusters. 2 The RLP/RSDP can be solved analytically. 3 The LAP exploits strong signal in M 0 and M j .
  • 23. Recovery Guarantee Theorem 2 (Perfect Separation) We have tightness hkkkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkkkj min RLP min RSDP min P costpπ q costp˜π qloooooooooooooooooomoooooooooooooooooon LAP-recovery . Proof Sketch (Recovery): M 0 M j
  • 24. Numerical Experiments I Ÿ Perform cardinality-constrained clustering on classification datasets8 . Ÿ tnj uk j1 : the number of true class occurrences. Ÿ Compare our SDP/LP+LAP approach with biconvex heuristic9 . Ÿ Optimality gaps yielded by the LP+LAP approach are À 20.6%. Ÿ Optimality gaps yielded by the SDP+LAP approach are À 2.9%. Ÿ The SDP approach is competitive with the biconvex heuristic. 8 UCI repository 9 Bennett, K. et al. (2000)
  • 25. Numerical Experiments II Ÿ Perform outlier detection on the breast cancer dataset10 . Ÿ Varying the number of malignant cancers n0. Ÿ Calculate prediction accuracy, false positives false negatives. 0 50 100 150 200 250 300 350 400 0 10 20 30 40 50 60 70 80 90 100 Á 80% prediction accuracy, À 3.3% optimality gap. 10 UCI repository
  • 26. References Ÿ Arthur, D. and Vassilvitskii, S. K-means++: the advantages of careful seeding. Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 2007. Ÿ Bennett, K., Bradley, P., Demiriz, A. Constrained k–Means Clustering. Microsoft Technical Report, 2000. Ÿ Rujeerapaiboon, N., Schindler, K., Kuhn, D., Wiesemann, W. Size matters: Cardinality-constrained clustering and outlier detection via conic optimization. SIAM Journal on Optimization 29(2), 2019. napat.rujeerapaiboon@nus.edu.sg Special thanks to artwork from tPopcorns Arts, Maxim Basinski, Business strategy, Freepik, Prosymbols, Vectors Market, Madebyoliver, Alfredo Hernandez, Devil, Roundiconsu@Flaticon.