Upcoming SlideShare
×

# IP Clustering as Exploratory Data Analysis

899 views
813 views

Published on

Class project presentation; MATH331 Optimization, Fall '09

Published in: Education, Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
899
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
7
0
Likes
0
Embeds 0
No embeds

No notes for slide

### IP Clustering as Exploratory Data Analysis

1. 1. Clustering Using Integer Programming Katie Kuksenok Michael Brooks
2. 2. What are we doing? • Investigating IP formulations of the clustering problem • Developing a web application for clustering • Applying clustering to several datasets • Visualizing resulting clusters
3. 3. The Clustering Problem • Input: ▫ N entities ▫ Each entity has a value for each of M features • Goal: ▫ Partition the entities “optimally” • Issues with clustering: ▫ Definition of “optimal” ▫ Different feature types don’t work well together
4. 4. The Clustering Problem Exposes high-level properties of data Important for exploratory data analysis in data mining Finding the optimal clustering is NP-Complete
5. 5. Grotschel and Wakabayashi IP • Recall the Grotschel-Wakabayashi formulation (clustering wild cats): • Variables: xij = 1 if i and j are in the same cluster • Data: wij = disagreements(i,j) – agreements(i,j) • Minimize: SUM { wij xij : over all i, j }
6. 6. Formulation: Constraints • Constraints for reflexivity, symmetry, transitivity reduce to the following constraints:
7. 7. Exploratory Clustering Application • Currently in development • User provides data to website • Web server processes data ▫ Produces OPL code ▫ Runs the model in LPSolve (another MIP solver) • Results are returned to the user, with visuals
8. 8. Preliminary Exploration • With working parts of application: • can cluster and visualize datasets • verified Grotschel/Wakabayashi clustering results • used clustering to explore senate voting records
9. 9. Data Sets (Cats) • Cluster 30 varieties of wild cats (from paper) • Cats have 14 features • Reproduced results from paper • Model has 12180 constraints and 435 variables • Runs in less than 1 second • No branching
10. 10. Data Sets (Senate Voting Records) • 100 senators • Senate records for 8 passed bills from this year • Model has 485100 constraints and 4950 variables • Runs in 1 minute • No branching
11. 11. Data Sets (Senate Voting By State) • Grouped senate voting by state • If senators for a state disagreed, ‘Maybe’ • Model has 58800 constraints and 1225 variables • Runs in 1.5 seconds • No branching
12. 12. What now? • The Application ▫ We’d like to put it all together • Branch and Bound ▫ No real-life datasets so far have required branching ▫ We have constructed datasets that did ▫ Can we determine under what conditions branching will be required?
13. 13. This is a guepard (cheetah in French?): …any other questions?
14. 14. Approaches to Clustering • What makes a good clustering algorithm? ▫ Runtime and space efficiency (when N is large) Many algorithms ▫ Ability to quickly incorporate new data ▫ Ability to deal with many-dimensional IP data (when M is large) ▫ Optimality (closeness to)