Mining 3-Clusters in Vertically Partitioned Data
Upcoming SlideShare
Loading in...5

Mining 3-Clusters in Vertically Partitioned Data






Total Views
Views on SlideShare
Embed Views



3 Embeds 32 29 2 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Show overlapping clusters with more emphasis and introduce the ide of a lattice for organizing the overlapping clusters.
  • Mark datasest as D1 and D2 Show more columns in D2
  • Explain using monotonicity ideas
  • state that width is always anti-monotonic

Mining 3-Clusters in Vertically Partitioned Data Mining 3-Clusters in Vertically Partitioned Data Presentation Transcript

  • Mining 3-Clusters in Vertically Partitioned Data
      • Faris Alqadah & Raj Bhatnagar
      • University of Cincinnati
  • Outline
    • Introduction to 3-clustering in binary, (categorical) vertically partitioned data
    • Proposed cluster quality measure
    • 3-Clu: algorithm for enumerating 3-clusters from two datasets
  • Introduction Traditional clustering Bi-Clustering 3-Clustering
  • Why 3-clusters?
    • Find correspondence between bi-clusters of two different datasets
    • Sharpen local clusters with outside knowledge
    • Alternative? “Join datasets then search”
      • Does not capture underlying interactions
      • Inefficient
      • Not always possible
  • Why 3-clusters? <A,1234> <AB,134> <AWB,13> <AY,12> <AX,24> <AWBCYZ,1> <ABDX,4>
  • Formal Definitions Bi-cluster in D i 3-Cluster across D 1 and D 2 Pattern in D i
  • Defining 3-clusters
    • D 1 is the “learner”
    • Maximal rectangle of 1's under suitable permutation in learner
    • Best Correspondence to rectangle of 1's in D 2
    D1 D1 D 1 D 2
  • Cluster Quality Measure
    • Intuition: Maximize number of 1's while also maximizing number of items and objects
    • Trade off between objects and items
      • More items...less objects
      • More objects...less items
  • Quality Measure
      • Consider bi-clusters in learner alone
    I 1 O C1 C2
    • Which is preferable ?
    • User decides
  • Quality Measure
    • Quality measure:
      • Monotonic in both width and height
        • Reflects intuition
      • Balances width and height according to user defined parameter
    • Introduce β
    • Amount of width(attributes) willing to trade for a single unit of height (objects)
  • Quality Measure
  • Extending to 3-clusters
    • Utilize same intuition
    • Width of 3-cluster is sum of individual widths
  • Selecting β
    • Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2
      • Cluster key websites popular with large number of democrats and republicans
    • Smaller values produce 3-clusters that are “narrow” and “long”
      • Discover long list of websites utilized by few select democrats and republicans
  • 3-Clu: Our Algorithm
    • Search for 3-clusters similar to search for closed itemsets
    • How to formulate the search space?
      • Assumption that objects out-number attributes may not hold
      • Several possible orderings of the search space
  • Algorithm
  • Algorithm
    • Define search space with primacy to objects
    • Only need to maintain one search tree
    • Mimic closed itemset algorithm with simultaneous pruning of search space
    • Prune with quality measure
  • Algorithm
  • Algorithm
    • Cluster quality measure is neither monotone nor anti-monotone in the search space
    • Pruning is still possible
    Is C2 of higher quality ?
  • Algorithm
  • Algorithm
    • Pruning rule is very optimistic
    • Can be adjusted with some a-priori information
    • Example β = 0.5
    • x=2.73...can't prune
      • This assumes w will stay at 15 for 3 more levels
  • Algorithm Analysis
    • Computational cost: O (|O|*i*N)
      • Only as expensive as enumerating bi-clusters in single dataset
    • Communication cost: O(N)
    • Correctness guaranteed by FCA theory
  • Experimental Results
    • Performance tests
    • Randomly split benchmark datasets CHESS and CONNECT
    • Genetic dataset: Genes, GO terms, Phenotypes
    • Compared to LCM and CHARM
  • Chess Connect GO-Pheno
  • Experimental Results
    • Test validity of 3-clusters
    • Randomly partitioned Mushrooms dataset by attributes
  • Conclusion
    • Novel concept of 3-clusters in vertically partitioned data
    • Introduced quality measure framework for 3-clusters
    • Presented efficient algorithm based on closed itemset mining algorithms, with adaptations:
      • Defined search space to enable simultaneous pruning
      • Incorporated novel pruning method based on cluster quality measure