Clustering Using
Integer Programming
Katie Kuksenok
Michael Brooks
What are we doing?
• Investigating IP formulations of the clustering
  problem

• Developing a web application for cluster...
The Clustering Problem
• Input:
  ▫ N entities
  ▫ Each entity has a value for each of M features

• Goal:
  ▫ Partition t...
The Clustering Problem
Exposes high-level properties of data

Important for exploratory data analysis in data
 mining

...
Grotschel and Wakabayashi IP
• Recall the Grotschel-Wakabayashi formulation
  (clustering wild cats):

• Variables: xij = ...
Formulation: Constraints
• Constraints for reflexivity, symmetry, transitivity
  reduce to the following constraints:
Exploratory Clustering Application
• Currently in development

• User provides data to website
• Web server processes data...
Preliminary Exploration
• With working parts of application:
• can cluster and visualize datasets

• verified Grotschel/Wa...
Data Sets (Cats)
• Cluster 30 varieties of wild cats (from paper)
• Cats have 14 features
• Reproduced results from paper
...
Data Sets (Senate Voting Records)
• 100 senators
• Senate records for 8 passed bills from this year

• Model has 485100 co...
Data Sets (Senate Voting By State)
• Grouped senate voting by state
• If senators for a state disagreed, ‘Maybe’

• Model ...
What now?
• The Application
 ▫ We’d like to put it all together
• Branch and Bound
 ▫ No real-life datasets so far have re...
This is a guepard (cheetah in French?):




       …any other questions?
Approaches to Clustering

             • What makes a good clustering
               algorithm?
              ▫ Runtime an...
IP Clustering as Exploratory Data Analysis
IP Clustering as Exploratory Data Analysis
IP Clustering as Exploratory Data Analysis
IP Clustering as Exploratory Data Analysis
Upcoming SlideShare
Loading in …5
×

IP Clustering as Exploratory Data Analysis

899 views
813 views

Published on

Class project presentation; MATH331 Optimization, Fall '09

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
899
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IP Clustering as Exploratory Data Analysis

  1. 1. Clustering Using Integer Programming Katie Kuksenok Michael Brooks
  2. 2. What are we doing? • Investigating IP formulations of the clustering problem • Developing a web application for clustering • Applying clustering to several datasets • Visualizing resulting clusters
  3. 3. The Clustering Problem • Input: ▫ N entities ▫ Each entity has a value for each of M features • Goal: ▫ Partition the entities “optimally” • Issues with clustering: ▫ Definition of “optimal” ▫ Different feature types don’t work well together
  4. 4. The Clustering Problem Exposes high-level properties of data Important for exploratory data analysis in data mining Finding the optimal clustering is NP-Complete
  5. 5. Grotschel and Wakabayashi IP • Recall the Grotschel-Wakabayashi formulation (clustering wild cats): • Variables: xij = 1 if i and j are in the same cluster • Data: wij = disagreements(i,j) – agreements(i,j) • Minimize: SUM { wij xij : over all i, j }
  6. 6. Formulation: Constraints • Constraints for reflexivity, symmetry, transitivity reduce to the following constraints:
  7. 7. Exploratory Clustering Application • Currently in development • User provides data to website • Web server processes data ▫ Produces OPL code ▫ Runs the model in LPSolve (another MIP solver) • Results are returned to the user, with visuals
  8. 8. Preliminary Exploration • With working parts of application: • can cluster and visualize datasets • verified Grotschel/Wakabayashi clustering results • used clustering to explore senate voting records
  9. 9. Data Sets (Cats) • Cluster 30 varieties of wild cats (from paper) • Cats have 14 features • Reproduced results from paper • Model has 12180 constraints and 435 variables • Runs in less than 1 second • No branching
  10. 10. Data Sets (Senate Voting Records) • 100 senators • Senate records for 8 passed bills from this year • Model has 485100 constraints and 4950 variables • Runs in 1 minute • No branching
  11. 11. Data Sets (Senate Voting By State) • Grouped senate voting by state • If senators for a state disagreed, ‘Maybe’ • Model has 58800 constraints and 1225 variables • Runs in 1.5 seconds • No branching
  12. 12. What now? • The Application ▫ We’d like to put it all together • Branch and Bound ▫ No real-life datasets so far have required branching ▫ We have constructed datasets that did ▫ Can we determine under what conditions branching will be required?
  13. 13. This is a guepard (cheetah in French?): …any other questions?
  14. 14. Approaches to Clustering • What makes a good clustering algorithm? ▫ Runtime and space efficiency (when N is large) Many algorithms ▫ Ability to quickly incorporate new data ▫ Ability to deal with many-dimensional IP data (when M is large) ▫ Optimality (closeness to)

×