Data Mining:      Concepts and Techniques                            — Chapter 9 —               9.3. Multirelational Data...
Multirelational Data Mining          Classification over multiple-relations in databases          Clustering over multi-...
Outline   Theme: “Knowledge is power, but knowledge is hidden in     massive links”   Starting with PageRank and HITS   ...
Traditional Data Mining      Work on single “flat” relations                                      Contact Doctor         ...
Multi-Relational Data Mining (MRDM)          Motivation             Most structured data are stored in relational       ...
Applications of MRDM      e-Commerce: discovering patterns involving customers,       products, manufacturers, …      Bi...
Importance of Multi-relational      Classification (from EELD Program Description)     The objective of the EELD Program ...
MRDM Approaches      Inductive Logic Programming (ILP)         Find models that are coherent with background          kn...
Inductive Logic Programming (ILP)      Find a hypothesis that is consistent with       background knowledge (training dat...
Inductive Logic Programming (ILP)      Hypothesis        The hypothesis is usually a set of rules,         which can pre...
FOIL: First-Order Inductive Learner      Find a set of rules consistent with training data         E.g. female(X), paren...
ILP Approaches      Top-down Approaches (e.g. FOIL)       while(enough examples left)         generate a rule         rem...
ILP – Pros and Cons      Advantages         Expressive and powerful         Rules are understandable      Disadvantage...
Automatically Classifying Objects                Using Multiple Relations          Why not convert multiple relational da...
An Example: Loan Applications                Ask the backend database                            Approve or not?          ...
The Backend Database                                                  Account                                 District    ...
Roadmap          Motivation          Rule-based Classification          Tuple ID Propagation          Rule Generation ...
Rule-based Classification               Ever bought a house              Live in Chicago        Approve!Applicant         ...
Rule Generation          Search for good predicates across multiple relations                                     Loan ID...
Previous Approaches          Inductive Logic Programming (ILP)             To build a rule                   Repeatedly...
CrossMine: An Efficient and Accurate           Multi-relational Classifier      Tuple-ID propagation: an efficient and fl...
Roadmap      Motivation      Rule-based Classification      Tuple ID Propagation      Rule Generation      Negative T...
Tuple ID Propagation                      Loan ID      Account ID        Amount     Duration       Decision               ...
Tuple ID Propagation (cont.)          Efficient             Only propagate the tuple IDs             Time and space usa...
Roadmap      Motivation      Rule-based Classification      Tuple ID Propagation      Rule Generation      Negative T...
Overall Procedure      Sequential covering algorithm       while(enough target tuples left)         generate a rule      ...
Rule Generation          To generate a rule            while(true)              find the best predicate p              if...
Evaluating Predicates          All predicates in a relation can be evaluated           based on propagated IDs          ...
Rule Generation      Start from the target relation         Only the target relation is active      Repeat         Sea...
Rule Generation: Example                                         Account                              District            ...
Look-one-ahead in Rule Generation          Two types of relations: Entity and Relationship          Often cannot find us...
Roadmap      Motivation      Rule-based Classification      Tuple ID Propagation      Rule Generation      Negative T...
Negative Tuple Sampling          A rule covers some positive examples          Positive examples are removed after cover...
Negative Tuple Sampling (cont.)   When there are much more negative examples than positive    ones      Cannot build goo...
Roadmap      Motivation      Rule-based Classification      Tuple ID Propagation      Rule Generation      Negative T...
Synthetic datasets:    Scalability w.r.t. number of relations              Scalability w.r.t. number of tuples04/06/12    ...
Real Dataset          PKDD Cup 99 dataset – Loan Application                         Accuracy                  Time (per ...
Multi-Relational Classification: Summary              Classification across multiple relations                Interestin...
Multirelational Data Mining          Classification over multiple-relations in databases          Clustering over multi-...
Multi-Relational and Multi-DB Mining          Classification over multiple-relations in databases          Clustering ov...
Motivation 1: Multi-Relational Clustering              Work-In               Professor                   Open-course     C...
Motivation 2: User-Guided Clustering           Work-In               Professor                      Open-course     Course...
Comparing with Classification                   User hint                                       User-specified feature (i...
Comparing with Semi-supervised Clustering          Semi-supervised clustering [Wagstaff, et al’ 01, Xing, et al.’02]     ...
Semi-supervised Clustering       Much information (in multiple relations) is needed to judge whether        two tuples ar...
CrossClus: An Overview      Use a new type of multi-relational features of       clustering      Measure similarity betw...
Roadmap  n        Overview  n        Feature Pertinence  n        Searching for Features  n        Clustering  n        Ex...
Multi-relational Features   A multi-relational feature is defined by:        A join path. E.g., Student → Register → Ope...
Representing Features    Most important information of a feature f is how f clusters objects into     groups    f is rep...
Similarity between Tuples     Categorical feature f        Defined as the probability t and t having the same value     ...
Similarity Between Features                                                                                               ...
Computing Feature Similarity                 ObjectsFeature f                     Feature g                     Similarity...
Similarity between Categorical and Numerical                     FeaturesV ⋅V = 2∑∑ sim h ( ti , t j ) ⋅ sim f ( ti , t j ...
Similarity between Numerical Features    Similarity between numerical features h and g       Suppose objects are ordered...
Roadmap  n        Overview  n        Feature Pertinence  n        Searching for Features  n        Clustering  n        Ex...
Searching for Pertinent Features     Different features convey different aspects of information           Research area  ...
Heuristic Search for Pertinent Features                              Work-In              Professor          Open-course  ...
Roadmap  n        Overview  n        Feature Pertinence  n        Searching for Features  n        Clustering  n        Ex...
Clustering with Multi-Relational Feature          Given a set of L pertinent features f1, …, fL, similarity           bet...
Roadmap  n        Overview  n        Feature Pertinence  n        Searching for Features  n        Clustering  n        Ex...
Experiments: Compare CrossClus with      Baseline: Only use the user specified feature      PROCLUS [Aggarwal, et al. 99...
Clustering Accuracy     To verify that CrossClus captures user’s clustering goal, we define      “accuracy” of clustering...
Measure of Clustering Accuracy      Accuracy              Measured by manually labeled data                   We manual...
CS Dept Dataset                               Clustering Accuracy - CS Dept      1    0.8                                 ...
DBLP Dataset                                Clustering Accurarcy - DBLP 10.90.80.7                                        ...
Scalability w.r.t. Data Size and # of Relations04/06/12           Data Mining: Principles and Algorithms   66
CrossClus: Summary              User guidance, even in a very simple form,               plays an important role in multi...
Multirelational Data Mining          Classification over multiple-relations in databases          Clustering over multi-...
Link-Based Clustering: Motivation              Authors         Proceedings                    Conferences               To...
Link-Based Similarities     Two objects are similar if they are linked with same or      similar objects                 ...
Observation 1: Hierarchical Structures   Hierarchical structures often exist naturally among    objects (e.g., taxonomy o...
Observation 2: Distribution of Similarity                                0.4                                              ...
Our Data Structure: SimTree   Each leaf node                                                     Each non-leaf nodereprese...
Similarity Defined by SimTree           Similarity between two           sibling nodes n1 and n2                          ...
Overview of LinkClus      Initialize a SimTree for objects of each type      Repeat              For each SimTree, upda...
Initialization of SimTrees      Initializing a SimTree         Repeatedly find groups of tightly related nodes, which   ...
(continued)          Finding tight groups               Frequent pattern mining                              Reduced to  ...
Updating Similarities Between Nodes     The initial similarities can seldom capture the relationships      between object...
Aggregation-Based Similarity Computation                                                       0.2                        ...
Computing Similarity with Aggregation                 Average similarity        a:                     b:(0.95,2)         ...
Adjusting SimTree Structures                               n1                        n2                  n3               ...
Complexity  For two types of objects, N in each, and M linkages between them.                                             ...
Empirical Study      Generating clusters using a SimTree         Suppose K clusters are to be generated         Find a ...
Experiment Setup     DBLP dataset: 4170 most productive authors, and 154 well-known      conferences with most proceeding...
1                                                       Accuracy                                                          ...
(continued)           1                                                         0.8      0.9                              ...
Email Dataset      F. Nielsen. Email dataset. http://www.imm.dtu.dk/∼rem/       data/Email-1431.zip      370 emails on c...
Scalability (1)                   Tested on synthetic datasets, with randomly generated                    clusters      ...
Scalability (2)                 Scalability w.r.t. number of objects & clusters                    Each cluster has fixe...
Scalability (3)                       Scalability w.r.t. number of linkages from each object        10000                ...
Multirelational Data Mining          Classification over multiple-relations in databases          Clustering over multi-...
People/Objects Do Share Names      Why distinguishing objects with identical names?      Different objects may share the...
(1)                                                                    (2)   Wei Wang, Jiong Yang,      VLDB      1997    ...
Challenges of Object Distinction      Related to duplicate detection, but              Textual similarity cannot be used...
Overview of DISTINCT      Measure similarity between references              Linkages between references                ...
Similarity 1: Link-based Similarity    Indicate the overall strength of connections between two     references    We use...
Example of Random Walk               Publish                                                  Authors           1.0 vldb/w...
Similarity 2: Neighborhood Similarity      Find the neighbor tuples of each reference              Neighbor tuples withi...
093
093
093
093
093
093
093
093
093
093
093
093
093
093
093
093
093
093
093
Upcoming SlideShare
Loading in …5
×

093

873 views
809 views

Published on

hello

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
873
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • P ( r ) (or N ( r )) is the number of positive (or negative) target tuples satisfying a rule r .
  • If a relation has multiple foreign keys but no primary key, it is a relation of relationship.
  • Number of relations/tuples and running time are in log scale.
  • Why not use traditional similarity measures such as cosine? t 1 :(0.33,0.33,0.33), t 2 :(0.33,0.33,0.33): Are they really similar according to course area? cos( t 1 , t 2 )= 1, sim f ( t 1 , t 2 )=0.33 t 1 :(0.0,1.0,0.0), t 2 :(0.0,1.0,0.0): Really similar according to course area cos( t 1 , t 2 )= 1, sim f ( t 1 , t 2 )=1
  • But, how to compute the similarity efficiently? Computing inner product of two N x N matrices is too expensive.
  • Very expensive to compute directly We convert it into another form
  • The dataset is collected in summer 2003, and thus does not contain information about many new professors.
  • In their paper Jeh and Widom give an efficient approach for approximating SimRank similarities, which only computes the similarity between two objects if they are linked with the same object. However, in this way the similarity between “sigmod” and “vldb” is never computed, neither is the similarity between “Mike” and “Cathy”, “sigmod03” and “vldb04”, etc. This will significantly affect the accuracy of clustering if no similarity is computed between those pairs of highly related objects.
  • We use this simple definition of tightness for efficiency concerns.
  • We use geometric average because they are often in different scales
  • min-sim is the minimum similarity for merging clusters. It controls the trade-off between false positives and false negatives. Please note that accuracy (Jaccard coefficient) is a tough measure. For two clustering C1 and C2, if 80% of equivalent pairs of C1 and 80% of those of C2 are overlapped, the accuracy is only 0.8/(0.8 + 0.2 + 0.2) = 0.667.
  • 093

    1. 1. Data Mining: Concepts and Techniques — Chapter 9 — 9.3. Multirelational Data Mining Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgements: Xiaoxin Yin04/06/12 Data Mining: Principles and Algorithms 1
    2. 2. Multirelational Data Mining  Classification over multiple-relations in databases  Clustering over multi-relations by user-guidance  LinkClus: Efficient clustering by exploring the power law distribution  Distinct: Distinguishing objects with identical names by link analysis  Mining across multiple heterogeneous data and information repositories  Summary04/06/12 Data Mining: Principles and Algorithms 2
    3. 3. Outline Theme: “Knowledge is power, but knowledge is hidden in massive links”  Starting with PageRank and HITS  CrossMine: Classification of multi-relations by link analysis  CrossClus: Clustering over multi-relations by user-guidance  More recent work and conclusions04/06/12 Data Mining: Principles and Algorithms 3
    4. 4. Traditional Data Mining  Work on single “flat” relations Contact Doctor Patient flatten  Lose information of linkages and relationships  Cannot utilize information of database structures or schemas04/06/12 Data Mining: Principles and Algorithms 4
    5. 5. Multi-Relational Data Mining (MRDM)  Motivation  Most structured data are stored in relational databases  MRDM can utilize linkage and structural information  Knowledge discovery in multi-relational environments  Multi-relational rules  Multi-relational clustering  Multi-relational classification  Multi-relational linkage analysis  …04/06/12 Data Mining: Principles and Algorithms 5
    6. 6. Applications of MRDM  e-Commerce: discovering patterns involving customers, products, manufacturers, …  Bioinformatics/Medical databases: discovering patterns involving genes, patients, diseases, …  Networking security: discovering patterns involving hosts, connections, services, …  Many other relational data sources  Example: Evidence Extraction and Link Discovery (EELD): A DARPA-funding project that emphasizes multi-relational and multi-database linkage analysis04/06/12 Data Mining: Principles and Algorithms 6
    7. 7. Importance of Multi-relational Classification (from EELD Program Description)  The objective of the EELD Program is to research, develop, demonstrate, and transition critical technology that will enable significant improvement in our ability to detect asymmetric threats …, e.g., a loosely organized terrorist group.  … Patterns of activity that, in isolation, are of limited significance but, when combined, are indicative of potential threats, will need to be learned.  Addressing these threats can only be accomplished by developing a new level of autonomic information surveillance and analysis to extract, discover, and link together sparse evidence from vast amounts of data sources, in different formats and with differing types and degrees of structure, to represent and evaluate the significance of the related evidence, and to learn patterns to guide the extraction, discovery, linkage and evaluation processes.04/06/12 Data Mining: Principles and Algorithms 7
    8. 8. MRDM Approaches  Inductive Logic Programming (ILP)  Find models that are coherent with background knowledge  Multi-relational Clustering Analysis  Clustering objects with multi-relational information  Probabilistic Relational Models  Model cross-relational probabilistic distributions  Efficient Multi-Relational Classification  The CrossMine Approach [Yin et al, 2004]04/06/12 Data Mining: Principles and Algorithms 8
    9. 9. Inductive Logic Programming (ILP)  Find a hypothesis that is consistent with background knowledge (training data)  FOIL, Golem, Progol, TILDE, …  Background knowledge  Relations (predicates), Tuples (ground facts) Training examples Background knowledge Parent(ann, mary) Female(ann) Daughter(mary, ann) + Parent(ann, tom) Female(mary) Daughter(eve, tom) + Parent(tom, eve) Female(eve) Daughter(tom, ann) – Parent(tom, ian) Daughter(eve, ann) –04/06/12 Data Mining: Principles and Algorithms 9
    10. 10. Inductive Logic Programming (ILP)  Hypothesis  The hypothesis is usually a set of rules, which can predict certain attributes in certain relations  Daughter(X,Y) ← female(X), parent(Y,X)04/06/12 Data Mining: Principles and Algorithms 10
    11. 11. FOIL: First-Order Inductive Learner  Find a set of rules consistent with training data  E.g. female(X), parent(Y,X) → daughter(X,Y)  A top-down, sequential covering learner Examples covered by Rule 2 Examples covered Examples covered by Rule 1 by Rule 3 All examples  Build each rule by heuristics  Foil gain – a special type of information gain04/06/12 Data Mining: Principles and Algorithms 11
    12. 12. ILP Approaches  Top-down Approaches (e.g. FOIL) while(enough examples left) generate a rule remove examples satisfying this rule  Bottom-up Approaches (e.g. Golem) Use each example as a rule Generalize rules by merging rules  Decision Tree Approaches (e.g. TILDE)04/06/12 Data Mining: Principles and Algorithms 12
    13. 13. ILP – Pros and Cons  Advantages  Expressive and powerful  Rules are understandable  Disadvantages  Inefficient for databases with complex schemas  Not appropriate for continuous attributes04/06/12 Data Mining: Principles and Algorithms 13
    14. 14. Automatically Classifying Objects Using Multiple Relations  Why not convert multiple relational data into a single table by joins?  Relational databases are designed by domain experts via semantic modeling (e.g., E-R modeling)  Indiscriminative joins may loose some essential information  One universal relation may not be appealing to efficiency, scalability and semantics preservation  Our approach to multi-relational classification:  Automatically classifying objects using multiple relations04/06/12 Data Mining: Principles and Algorithms 14
    15. 15. An Example: Loan Applications Ask the backend database Approve or not? Apply for loan04/06/12 Data Mining: Principles and Algorithms 15
    16. 16. The Backend Database Account District account-id Loan district-id district-id dist-nameTarget relation: loan-id frequency account-id Card regionEach tuple has a date date card-id #peopleclass label, amount disp-id #lt-500indicating whether a duration Transaction type #lt-2000loan is paid on time. payment trans-id issue-date #lt-10000 #gt-10000 account-id #city date Disposition ratio-urban Order type disp-id avg-salary order-id operation account-id unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id How to make decisions to loan applications?04/06/12 Data Mining: Principles and Algorithms 16
    17. 17. Roadmap  Motivation  Rule-based Classification  Tuple ID Propagation  Rule Generation  Negative Tuple Sampling  Performance Study04/06/12 Data Mining: Principles and Algorithms 17
    18. 18. Rule-based Classification Ever bought a house Live in Chicago Approve!Applicant Just apply for a credit card Reject …Applicant04/06/12 Data Mining: Principles and Algorithms 18
    19. 19. Rule Generation  Search for good predicates across multiple relations Loan ID Account ID Amount Duration Decision 1 124 1000 12 YesApplicant #1 2 124 4000 12 Yes 3 108 10000 24 No 4 45 12000 36 No Loan ApplicationsApplicant #2 Account ID Frequency Open date District ID 128 monthly 02/27/96 61820 108 weekly 09/23/95 61820Applicant #3 45 monthly 12/09/94 61801 Orders 67 weekly 01/01/95 61822 AccountsApplicant #4 Other relations Districts04/06/12 Data Mining: Principles and Algorithms 19
    20. 20. Previous Approaches  Inductive Logic Programming (ILP)  To build a rule  Repeatedly find the best predicate  To evaluate a predicate on relation R, first join target relation with R  Not scalable because  Huge search space (numerous candidate predicates)  Not efficient to evaluate each predicate  To evaluate a predicate Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, ‘monthly’,?) first join loan relation with account relation  CrossMine is more scalable and more than one hundred times faster on datasets with reasonable sizes04/06/12 Data Mining: Principles and Algorithms 20
    21. 21. CrossMine: An Efficient and Accurate Multi-relational Classifier  Tuple-ID propagation: an efficient and flexible method for virtually joining relations  Confine the rule search process in promising directions  Look-one-ahead: a more powerful search strategy  Negative tuple sampling: improve efficiency while maintaining accuracy04/06/12 Data Mining: Principles and Algorithms 21
    22. 22. Roadmap  Motivation  Rule-based Classification  Tuple ID Propagation  Rule Generation  Negative Tuple Sampling  Performance Study04/06/12 Data Mining: Principles and Algorithms 22
    23. 23. Tuple ID Propagation Loan ID Account ID Amount Duration Decision 1 124 1000 12 Yes 2 124 4000 12 YesApplicant #1 3 108 10000 24 No 4 45 12000 36 NoApplicant #2 Account ID Frequency Open date Propagated ID Labels 124 monthly 02/27/93 1, 2 2+, 0– 108 weekly 09/23/97 3 0+, 1– 45 monthly 12/09/96 4 0+, 1–Applicant #3 67 weekly 01/01/97 Null 0+, 0– Possible predicates: •Frequency=‘monthly’: 2 +, 1 – •Open date < 01/01/95: 2 +, 0 –Applicant #4  Propagate tuple IDs of target relation to non-target relations  Virtually join relations to avoid the high cost of physical joins04/06/12 Data Mining: Principles and Algorithms 23
    24. 24. Tuple ID Propagation (cont.)  Efficient  Only propagate the tuple IDs  Time and space usage is low  Flexible  Can propagate IDs among non-target relations  Many sets of IDs can be kept on one relation, which are propagated from different join paths Target R1 R2 Relation R304/06/12 Data Mining: Principles and Algorithms 24
    25. 25. Roadmap  Motivation  Rule-based Classification  Tuple ID Propagation  Rule Generation  Negative Tuple Sampling  Performance Study04/06/12 Data Mining: Principles and Algorithms 25
    26. 26. Overall Procedure  Sequential covering algorithm while(enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered Examples covered by Rule 2 by Rule 1 Examples covered by Rule 3 Positive examples04/06/12 Data Mining: Principles and Algorithms 26
    27. 27. Rule Generation  To generate a rule while(true) find the best predicate p if foil-gain(p)>threshold then add p to current rule else break A3=1&&A1=2 A3=1&&A1=2 &&A8=5A3=1 Positive Negative examples examples04/06/12 Data Mining: Principles and Algorithms 27
    28. 28. Evaluating Predicates  All predicates in a relation can be evaluated based on propagated IDs  Use foil-gain to evaluate predicates  Suppose current rule is r. For a predicate p, foil-gain(p) =  P( r ) P( r + p )  P( r + p ) × − log + log  P( r ) + N ( r ) P( r + p ) + N ( r + p )    Categorical Attributes  Compute foil-gain directly  Numerical Attributes  Discretize with every possible value04/06/12 Data Mining: Principles and Algorithms 28
    29. 29. Rule Generation  Start from the target relation  Only the target relation is active  Repeat  Search in all active relations  Search in all relations joinable to active relations  Add the best predicate to the current rule  Set the involved relation to active  Until  The best predicate does not have enough gain  Current rule is too long04/06/12 Data Mining: Principles and Algorithms 29
    30. 30. Rule Generation: Example Account District account-idTarget relation Loan district-id district-id dist-name loan-id frequency region account-id Card date card-id #people date amount First predicate disp-id #lt-500 duration type #lt-2000 Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city date Disposition ratio-urban Order type disp-id avg-salary order-id operation account-id unemploy95 account-id amount client-id Second unemploy96 bank-to predicate balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date genderRange of Search district-id Add best predicate to rule04/06/12 Data Mining: Principles and Algorithms 30
    31. 31. Look-one-ahead in Rule Generation  Two types of relations: Entity and Relationship  Often cannot find useful predicates on relations of relationship No good predicate Target Relation  Solution of CrossMine:  When propagating IDs to a relation of relationship, propagate one more step to next relation of entity.04/06/12 Data Mining: Principles and Algorithms 31
    32. 32. Roadmap  Motivation  Rule-based Classification  Tuple ID Propagation  Rule Generation  Negative Tuple Sampling  Performance Study04/06/12 Data Mining: Principles and Algorithms 32
    33. 33. Negative Tuple Sampling  A rule covers some positive examples  Positive examples are removed after covered  After generating many rules, there are much less positive examples than negative ones – – + + – – + + – + – + + – + – +– + – + – – – – + – – + + – + +04/06/12 Data Mining: Principles and Algorithms 33
    34. 34. Negative Tuple Sampling (cont.) When there are much more negative examples than positive ones  Cannot build good rules (low support)  Still time consuming (large number of negative examples) Make sampling on negative examples  Improve efficiency without affecting rule quality – – – – – + – – – – – – – – – – – + + – + +04/06/12 Data Mining: Principles and Algorithms 34
    35. 35. Roadmap  Motivation  Rule-based Classification  Tuple ID Propagation  Rule Generation  Negative Tuple Sampling  Performance Study04/06/12 Data Mining: Principles and Algorithms 35
    36. 36. Synthetic datasets: Scalability w.r.t. number of relations Scalability w.r.t. number of tuples04/06/12 Data Mining: Principles and Algorithms 36
    37. 37. Real Dataset  PKDD Cup 99 dataset – Loan Application Accuracy Time (per fold) FOIL 74.0% 3338 sec TILDE 81.3% 2429 sec CrossMine 90.7% 15.3 sec  Mutagenesis dataset (4 relations) Accuracy Time (per fold) FOIL 79.7% 1.65 sec TILDE 89.4% 25.6 sec CrossMine 87.7% 0.83 sec04/06/12 Data Mining: Principles and Algorithms 37
    38. 38. Multi-Relational Classification: Summary  Classification across multiple relations  Interesting pieces of information often lie across multiple relations  It is desirable to mine across multiple interconnected relations  New methodology in CrossMine (for classification model building)  ID (and class label) propagation leads to efficiency and effectiveness (by preserving semantics) in CrossMine  Rule generation and negative tuple sampling lead to further improved performance  Our performance study shows orders of magnitude faster and high accuracy comparing with the Relational Mining approach  Future work: classification in heterogeneous relational databases04/06/12 Data Mining: Principles and Algorithms 38
    39. 39. Multirelational Data Mining  Classification over multiple-relations in databases  Clustering over multi-relations by user-guidance  LinkClus: Efficient clustering by exploring the power law distribution  Distinct: Distinguishing objects with identical names by link analysis  Mining across multiple heterogeneous data and information repositories  Summary04/06/12 Data Mining: Principles and Algorithms 39
    40. 40. Multi-Relational and Multi-DB Mining  Classification over multiple-relations in databases  Clustering over multi-relations by User-Guidance  Mining across multi-relational databases  Mining across multiple heterogeneous data and information repositories  Summary04/06/12 Data Mining: Principles and Algorithms 40
    41. 41. Motivation 1: Multi-Relational Clustering Work-In Professor Open-course Course person name course course-id group office semester name position instructor area Publication Publish Advise title Group author professor year name title student conf area degree Student Register name student Target of office clustering course position semester unit grade  Traditional clustering works on a single table  Most data is semantically linked with multiple relations  Thus we need information in multiple relations04/06/12 Data Mining: Principles and Algorithms 41
    42. 42. Motivation 2: User-Guided Clustering Work-In Professor Open-course Course person name course course-id group office semester name position instructor area Publish Publication Group Advise author title professor name title year student area conf degree Register User hint student Student course name semester Target of office unit clustering position grade  User usually has a goal of clustering, e.g., clustering students by research area  User specifies his clustering goal to CrossClus04/06/12 Data Mining: Principles and Algorithms 42
    43. 43. Comparing with Classification User hint  User-specified feature (in the form of attribute) is used as a hint, not class labels  The attribute may contain too many or too few distinct values  E.g., a user may want to cluster students into 20 clusters instead of 3  Additional features need to be All tuples for clustering included in cluster analysis04/06/12 Data Mining: Principles and Algorithms 43
    44. 44. Comparing with Semi-supervised Clustering  Semi-supervised clustering [Wagstaff, et al’ 01, Xing, et al.’02]  User provides a training set consisting of “similar” and “dissimilar” pairs of objects  User-guided clustering  User specifies an attribute as a hint, and more relevant features are found for clustering Semi-supervised clustering User-guided clustering x All tuples for clustering All tuples for clustering04/06/12 Data Mining: Principles and Algorithms 44
    45. 45. Semi-supervised Clustering  Much information (in multiple relations) is needed to judge whether two tuples are similar  A user may not be able to provide a good training set  It is much easier for a user to specify an attribute as a hint, such as a student’s research area Tom Smith SC1211 TA Jane Chang BI205 RA Tuples to be compared User hint04/06/12 Data Mining: Principles and Algorithms 45
    46. 46. CrossClus: An Overview  Use a new type of multi-relational features of clustering  Measure similarity between features by how they cluster objects into groups  Use a heuristic method to search for pertinent features  Use a k-medoids-based algorithm for clustering04/06/12 Data Mining: Principles and Algorithms 46
    47. 47. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 47
    48. 48. Multi-relational Features  A multi-relational feature is defined by:  A join path. E.g., Student → Register → OpenCourse → Course  An attribute. E.g., Course.area  (For numerical feature) an aggregation operator. E.g., sum or average  Categorical Feature f = [Student → Register → OpenCourse → Course, Course.area, null]areas of courses of each student Values of feature f f(t1) Tuple Areas of courses Tuple Feature f f(t2) DB AI TH DB AI TH t1 5 5 0 t1 0.5 0.5 0 DB t2 0 3 7 t2 0 0.3 0.7 f(t3) AI t3 1 5 4 t3 0.1 0.5 0.4 f(t4) TH t4 5 0 5 t4 0.5 0 0.5 t5 3 3 4 t5 0.3 0.3 0.4 f(t5)  Numerical Feature, e.g., average grades of students  h = [Student → Register, Register.grade, average]  E.g. h(t1) = 3.504/06/12 Data Mining: Principles and Algorithms 48
    49. 49. Representing Features  Most important information of a feature f is how f clusters objects into groups  f is represented by similarities between every pair of objects indicated by f Similarity vector Vf Similarity between each pair of tuples indicated by f. The horizontal axes are the tuple indices, and the vertical 1 axis is the similarity. 0.9 0.5-0.6 0.8 0.4-0.5 This can be considered as a 0.7 0.3-0.4 0.6 0.2-0.3 vector of N x N dimensions. 0.5 5 0.1-0.2 0.4 0.3 4 0-0.1 0.2 0.1 3 0 S5 2 S4 S3 S2 1 S104/06/12 Data Mining: Principles and Algorithms 49
    50. 50. Similarity between Tuples  Categorical feature f  Defined as the probability t and t having the same value 1 2  e.g. when each of them selects another course, what is the probability they select courses of the same area L sim f ( t1 , t 2 ) = ∑ f ( t1 ). pk ⋅ f ( t 2 ). pk k =1 DB AI TH simf(t1,t2)=0.5*0.3+0.5*0.3=0.3  Numerical feature h  h( t ) − h( t 2 )  1− 1 Simh ( t1 , t 2 ) =  ( ) ( ) σ h , if h t1 − h t 2 < σ h 0, otherwise 04/06/12 Data Mining: Principles and Algorithms 50
    51. 51. Similarity Between Features Vf Values of Feature f and g Feature f (course) Feature g (group) 1 0.9 0.5-0.6 0.8 DB AI TH Info sys Cog sci Theory 0.7 0.4-0.5 0.3-0.4 0.6 0.2-0.3 0.5 5 t1 0.5 0.5 0 1 0 0 0.4 0.3 4 0.1-0.2 0-0.1 0.2 t2 0 0.3 0.7 0 0 1 0.1 0 3 S5 2 t3 0.1 0.5 0.4 0 0.5 0.5 S4 S3 S2 1 S1 t4 0.5 0 0.5 0.5 0 0.5 t5 0.3 0.3 0.4 0.5 0.5 0 Vg Similarity between two features – 1 cosine similarity of two vectors 0.9 0.8 0.5-0.6 0.7 V ⋅V 0.4-0.5 f g 0.6 Sim( f , g ) = f g 0.3-0.4 0.5 0.2-0.3 0.4 0.1-0.2 0.3 1 0-0.1 V V 0.2 0.1 0 3 2 S1 S2 4 S3 S4 5 S504/06/12 Data Mining: Principles and Algorithms 51
    52. 52. Computing Feature Similarity ObjectsFeature f Feature g Similarity between feature values w.r.t. the objects DB Info sys sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq AI Cog sci DB Info sys TH Theory 2 V f ⋅ V g = ∑∑Objectf similarities, g ( ti , t j ) = ∑∑ sim( fsimilarities, sim ( ti , t j ) ⋅ sim Feature value k , g q ) N N l m i =1 j =1 hard to compute k =1 q =easy to compute 1 DB Info sys Compute similarity between each pair of AI Cog sci feature values by one scan on data TH Theory04/06/12 Data Mining: Principles and Algorithms 52
    53. 53. Similarity between Categorical and Numerical FeaturesV ⋅V = 2∑∑ sim h ( ti , t j ) ⋅ sim f ( ti , t j ) N h f i =1 j <i    = 2∑ ∑ f ( ti ). pk (1 − h( ti ) )  ∑ f ( t j ). pk  + 2∑ ∑ f ( ti ). pk  ∑ h( t j ) ⋅ f ( t j ). pk  N l N l     i =1 k =1  j <i  i =1 k =1  j <i  Only depend on ti Depend on all tj with j<i ObjectsFeature f (ordered by h) Feature h DB 2.7 Parts depending on ti 2.9 3.1 AI 3.3 3.5 Parts depending on all ti with j<i TH 3.7 3.904/06/12 Data Mining: Principles and Algorithms 53
    54. 54. Similarity between Numerical Features  Similarity between numerical features h and g  Suppose objects are ordered according to h  Computing Vh∙Vg  Scan objects in order of h  When scanning each object t*, maintain the set of objects t with 0<h(t*)-h(t)<σh, in a binary tree sorted with g  Update Vh∙Vg by all t with 0<h(t*)-h(t*)<σh, |g(t)-g(t*)|<σg Feature h Objects 2.7 A search tree containing 2.9 objects t with 0<h(t*)-h(t)<σh, 3.1 t* which is sorted in g(t) 3.3 3.5 t* 3.7 Objects t satisfying 3.9 0<h(t*)-h(t)<σh, |g(t)-g(t*)|<σg04/06/12 Data Mining: Principles and Algorithms 54
    55. 55. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 55
    56. 56. Searching for Pertinent Features  Different features convey different aspects of information Research area Academic Performances Research group area Demographic info GPA Conferences of papers Permanent address GRE score Advisor Nationality Number of papers  Features conveying same aspect of information usually cluster objects in more similar ways  research group areas vs. conferences of publications  Given user specified feature  Find pertinent features by computing feature similarity04/06/12 Data Mining: Principles and Algorithms 56
    57. 57. Heuristic Search for Pertinent Features Work-In Professor Open-course Course person name course course-idOverall procedure group office semester name1.Start from the user- 2 position instructor area specified feature Advise Publication Group Publish2. Search in neighborhood name professor author title of existing pertinent area student 1 title year features degree conf Register3. Expand search range User hint student gradually Student course name semester office Target of unit clustering position grade Tuple ID propagation [Yin, et al.’04] is used to create multi-relational features  IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple04/06/12 Data Mining: Principles and Algorithms 57
    58. 58. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 58
    59. 59. Clustering with Multi-Relational Feature  Given a set of L pertinent features f1, …, fL, similarity between two objects L sim( t1 , t 2 ) = ∑ sim f i ( t1 , t 2 ) ⋅ f i .weight i =1  Weight of a feature is determined in feature search by its similarity with other pertinent features  For clustering, we use CLARANS, a scalable k-medoids [Ng & Han’94] algorithm04/06/12 Data Mining: Principles and Algorithms 59
    60. 60. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 60
    61. 61. Experiments: Compare CrossClus with  Baseline: Only use the user specified feature  PROCLUS [Aggarwal, et al. 99]: a state-of-the-art subspace clustering algorithm  Use a subset of features for each cluster  We convert relational database to a table by propositionalization  User-specified feature is forced to be used in every cluster  RDBC [Kirsten and Wrobel’00]  A representative ILP clustering algorithm  Use neighbor information of objects for clustering  User-specified feature is forced to be used04/06/12 Data Mining: Principles and Algorithms 61
    62. 62. Clustering Accuracy  To verify that CrossClus captures user’s clustering goal, we define “accuracy” of clustering  Given a clustering task  Manually find all features that contain information directly related to the clustering task – standard feature set  E.g., Clustering students by research areas  Standard feature set: research group, group areas, advisors, conferences of publications, course areas  Accuracy of clustering result: how similar it is to the clustering generated by standard feature set deg( C ⊂ C ) = ∑ n i =1 ( max1≤ j ≤n ci ∩ c j ) ∑ n i =1 ci deg( C ⊂ C ) + deg( C ⊂ C ) sim( C , C ) = 204/06/12 Data Mining: Principles and Algorithms 62
    63. 63. Measure of Clustering Accuracy  Accuracy  Measured by manually labeled data  We manually assign tuples into clusters according to their properties (e.g., professors in different research areas)  Accuracy of clustering: Percentage of pairs of tuples in the same cluster that share common label  This measure favors many small clusters  We let each approach generate the same number of clusters04/06/12 Data Mining: Principles and Algorithms 63
    64. 64. CS Dept Dataset Clustering Accuracy - CS Dept 1 0.8 CrossClus K-Medoids CrossClus K-Means 0.6 CrossClus Agglm 0.4 Baseline PROCLUS 0.2 RDBC 0 Group Course Group+Course (Theory): J. Erickson, S. Har-Peled, L. Pitt, E. Ramos, D. Roth, M. Viswanathan (Graphics): J. Hart, M. Garland, Y. Yu (Database): K. Chang, A. Doan, J. Han, M. Winslett, C. Zhai (Numerical computing): M. Heath, T. Kerkhoven, E. de Sturler (Networking & QoS): R. Kravets, M. Caccamo, J. Hou, L. Sha (Artificial Intelligence): G. Dejong, M. Harandi, J. Ponce, L. Rendell (Architecture): D. Padua, J. Torrellas, C. Zilles, S. Adve, M. Snir, D. Reed, V. Adve (Operating Systems): D. Mickunas, R. Campbell, Y. Zhou04/06/12 Data Mining: Principles and Algorithms 64
    65. 65. DBLP Dataset Clustering Accurarcy - DBLP 10.90.80.7 CrossClus K-Medoids0.6 CrossClus K-Means0.5 CrossClus Agglm0.4 Baseline PROCLUS0.3 RDBC0.20.1 0 e d nf d r or or re ho or or Co th th th W W ut au au ll oa + Co Co A nf +C Co + nf d or Co W04/06/12 Data Mining: Principles and Algorithms 65
    66. 66. Scalability w.r.t. Data Size and # of Relations04/06/12 Data Mining: Principles and Algorithms 66
    67. 67. CrossClus: Summary  User guidance, even in a very simple form, plays an important role in multi-relational clustering  CrossClus finds pertinent features by computing similarities between features04/06/12 Data Mining: Principles and Algorithms 67
    68. 68. Multirelational Data Mining  Classification over multiple-relations in databases  Clustering over multi-relations by user-guidance  LinkClus: Efficient clustering by exploring the power law distribution  Distinct: Distinguishing objects with identical names by link analysis  Mining across multiple heterogeneous data and information repositories  Summary04/06/12 Data Mining: Principles and Algorithms 68
    69. 69. Link-Based Clustering: Motivation Authors Proceedings Conferences Tom sigmod03 sigmod04 sigmod Mike sigmod05 vldb03 Cathy vldb04 vldb John vldb05 aaai04 aaai Mary aaai05 Questions: Q1: How to cluster each type of objects? Q2: How to define similarity between each type of objects?04/06/12 Data Mining: Principles and Algorithms 69
    70. 70. Link-Based Similarities  Two objects are similar if they are linked with same or similar objects Jeh & Widom, 2002 - SimRank sigmod03 Tom sigmod04 sigmod The similarity between two objects x Mary and y is defined as the average sigmod05 similarity between objects linked with x and those with y. Tom sigmod03 But: It is expensive to compute: sigmod04 sigmod Mike For a dataset of N objects and M sigmod05 links, it takes O(N2) space and vldb03 O(M2) time to compute all Cathy similarities. vldb04 vldb John vldb0504/06/12 Data Mining: Principles and Algorithms 70
    71. 71. Observation 1: Hierarchical Structures Hierarchical structures often exist naturally among objects (e.g., taxonomy of animals) Relationships between articles and A hierarchical structure of words (Chakrabarti, Papadimitriou, products in Walmart Modha, Faloutsos, 2004) All Articlesgrocery electronics apparel TV DVD camera Words04/06/12 Data Mining: Principles and Algorithms 71
    72. 72. Observation 2: Distribution of Similarity 0.4 Distribution of SimRank similarities portion of entries 0.3 among DBLP authors 0.2 0.1 0 0.04 0.18 0.24 0.02 0.06 0.08 0.12 0.14 0.16 0.22 0.1 0.2 0 similarity value Power law distribution exists in similarities  56% of similarity entries are in [0.005, 0.015]  1.4% of similarity entries are larger than 0.1  Our goal: Design a data structure that stores the significant similarities and compresses insignificant ones04/06/12 Data Mining: Principles and Algorithms 72
    73. 73. Our Data Structure: SimTree Each leaf node Each non-leaf noderepresents an object represents a group of similar lower-level nodes Similarities between siblings are stored Canon A40 digital camera Digital Sony V3 digital Cameras Consumer Apparels camera electronics TVs 04/06/12 Data Mining: Principles and Algorithms 73
    74. 74. Similarity Defined by SimTree Similarity between two sibling nodes n1 and n2 n1 0.2 n2 n3 0.8 0.9 0.9 Adjustment ratio for node n7 n4 0.3 n5 n6 0.9 0.8 1.0 n7 n8 n9  simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)  Path-based node similarity  Similarity between two nodes is the average similarity between objects linked with them in other SimTrees Average similarity between x and all other nodes  Adjustment ratio for x = Average similarity between x’s parent and all other nodes04/06/12 Data Mining: Principles and Algorithms 74
    75. 75. Overview of LinkClus  Initialize a SimTree for objects of each type  Repeat  For each SimTree, update the similarities between its nodes using similarities in other SimTrees  Similarity between two nodes x and y is the average similarity between objects linked with them  Adjust the structure of each SimTree  Assign each node to the parent node that it is most similar to04/06/12 Data Mining: Principles and Algorithms 75
    76. 76. Initialization of SimTrees  Initializing a SimTree  Repeatedly find groups of tightly related nodes, which are merged into a higher-level node  Tightness of a group of nodes  For a group of nodes {n , …, n }, its tightness is 1 k defined as the number of leaf nodes in other SimTrees that are connected to all of {n1, …, nk} Nodes Leaf nodes in another SimTree n1 1 2 The tightness of {n1, n2} is 3 3 n2 4 504/06/12 Data Mining: Principles and Algorithms 76
    77. 77. (continued)  Finding tight groups Frequent pattern mining Reduced to Transactions n1 1 {n1}The tightness of a g1 2 {n1, n2}group of nodes is the n2 3 {n2} 4 {n1, n2}support of a frequent 5 {n1, n2}pattern n3 6 {n2, n3, g2 7 n4} {n4} n4 8 {n3, n4} 9 {n3, n4}  Procedure of initializing a tree  Start from leaf nodes (level-0)  At each level l, find non-overlapping groups of similar nodes with frequent pattern mining04/06/12 Data Mining: Principles and Algorithms 77
    78. 78. Updating Similarities Between Nodes  The initial similarities can seldom capture the relationships between objects  Iteratively update similarities  Similarity between two nodes is the average similarity between objects linked with them 0 ST2 1 2 3 sim(na,nb) = 10 13 4 5 6 7 8 9 average similarity between 11 and 14 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 12 takes O(3x2) time z ST1 c d e a b f g h k l m n o p q r s t u v w x y04/06/12 Data Mining: Principles and Algorithms 78
    79. 79. Aggregation-Based Similarity Computation 0.2 4 5 ST2 0.9 1.0 0.8 0.9 1.0 10 11 12 13 14 a b ST1 For each node nk∈{n10,n11,n12} and nl∈{n13,n14}, their path-based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl). ∑k =10 s( nk , n4 ) ∑ s ( nl , n5 ) 12 14 sim( na , nb ) = ⋅ s( n , n ) ⋅ 4 5 l =13 = 0.171 3 2 takes O(3+2) time After aggregation, we reduce quadratic time computation to linear time computation.04/06/12 Data Mining: Principles and Algorithms 79
    80. 80. Computing Similarity with Aggregation Average similarity a: b:(0.95,2) and total weight (0.9,3) 4 0.2 5 sim(na, nb) can be computed 10 11 12 13 14 from aggregated similarities a b sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5) = 0.9 x 0.2 x 0.95 = 0.171 To compute sim(na,nb):  Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb with nj.  Calculate similarity (and weight) between na and nb w.r.t. ni and nj.  Calculate weighted average similarity between na and nb w.r.t. all such pairs.04/06/12 Data Mining: Principles and Algorithms 80
    81. 81. Adjusting SimTree Structures n1 n2 n3 0.9 n4 n5 n6 0.8 n7 n7 n8 n9 After similarity changes, the tree structure also needs to be changed  If a node is more similar to its parent’s sibling, then move it to be a child of that sibling  Try to move each node to its parent’s sibling that it is most similar to, under the constraint that each parent node can have at most c children04/06/12 Data Mining: Principles and Algorithms 81
    82. 82. Complexity For two types of objects, N in each, and M linkages between them. Time Space Updating similarities O(M(logN)2) O(M+N) Adjusting tree structures O(N) O(N) LinkClus O(M(logN)2) O(M+N) SimRank O(M2) O(N2)04/06/12 Data Mining: Principles and Algorithms 82
    83. 83. Empirical Study  Generating clusters using a SimTree  Suppose K clusters are to be generated  Find a level in the SimTree that has number of nodes closest to K  Merging most similar nodes or dividing largest nodes on that level to get K clusters  Accuracy  Measured by manually labeled data  Accuracy of clustering: Percentage of pairs of objects in the same cluster that share common label  Efficiency and scalability  Scalability w.r.t. number of objects, clusters, and linkages04/06/12 Data Mining: Principles and Algorithms 83
    84. 84. Experiment Setup  DBLP dataset: 4170 most productive authors, and 154 well-known conferences with most proceedings  Manually labeled research areas of 400 most productive authors according to their home pages (or publications)  Manually labeled areas of 154 conferences according to their call for papers  Approaches Compared:  SimRank (Jeh & Widom, KDD 2002)  Computing pair-wise similarities  SimRank with FingerPrints (F-SimRank)  Fogaras & R´acz, WWW 2005  pre-computes a large sample of random paths from each object and uses samples of two objects to estimate SimRank similarity  ReCom (Wang et al. SIGIR 2003)  Iteratively clustering objects using cluster labels of linked objects04/06/12 Data Mining: Principles and Algorithms 84
    85. 85. 1 Accuracy 0.8 Conferences Authors 0.7 0.95 0.6accuracy 0.5 accuracy 0.9 0.4 LinkClus 0.3 0.85 LinkClus SimRank SimRank ReCom 0.2 ReCom 0.8 F-SimRank 0.1 F-SimRank 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 #iteration #iteration Approaches Accr-Author Accr-Conf average time LinkClus 0.957 0.723 76.7 SimRank 0.958 0.760 1020 ReCom 0.907 0.457 43.1 F-SimRank 0.908 0.583 83.6 04/06/12 Data Mining: Principles and Algorithms 85
    86. 86. (continued) 1 0.8 0.9 0.6 AccuracyAccuracy 0.8 0.4 LinkClus LinkClus SimRank SimRank 0.7 ReCom 0.2 ReCom F-SimRank F-SimRank P-SimRank P-SimRank 0.6 0 10 100 1000 10000 100000 10 100 1000 10000 100000 Time (sec) Time (sec)  Accuracy vs. Running time  LinkClus is almost as accurate as SimRank (most accurate), and is much more efficient04/06/12 Data Mining: Principles and Algorithms 86
    87. 87. Email Dataset  F. Nielsen. Email dataset. http://www.imm.dtu.dk/∼rem/ data/Email-1431.zip  370 emails on conferences, 272 on jobs, and 789 spam emails Total time Approach Accuracy (sec) LinkClus 0.8026 1579.6 SimRank 0.7965 39160 ReCom 0.5711 74.6 F-SimRank 0.3688 479.7 CLARANS 0.4768 8.5504/06/12 Data Mining: Principles and Algorithms 87
    88. 88. Scalability (1)  Tested on synthetic datasets, with randomly generated clusters  Scalability w.r.t. number of objects  Number of clusters is fixed (40) LinkClus 0.8 10000 SimRank LinkClus ReCom 0.7 SimRank F-SimRank ReCom O(N) 0.6 F-SimRank 1000 O(N*(logN)^2) O(N^2) 0.5time (sec) Accuracy 0.4 100 0.3 0.2 0.1 10 0 1000 2000 3000 4000 5000 #objects per relation 1000 2000 3000 4000 5000 #objects per relation 04/06/12 Data Mining: Principles and Algorithms 88
    89. 89. Scalability (2)  Scalability w.r.t. number of objects & clusters  Each cluster has fixed size (100 objects) 10000 0.8 LinkClus SimRank 0.7 ReCom 1000 0.6 F-SimRank 0.5 Accuracytime (sec) 100 LinkClus 0.4 SimRank 0.3 ReCom 10 F-SimRank 0.2 O(N) O(N*(logN)^2) 0.1 1 O(N^2) 0 500 1000 2000 5000 10000 20000 500 1000 2000 5000 10000 20000 #objects per relation #objects per relation04/06/12 Data Mining: Principles and Algorithms 89
    90. 90. Scalability (3)  Scalability w.r.t. number of linkages from each object 10000 1 0.8 1000 0.6time (sec) Accuracy LinkClus 0.4 LinkClus 100 SimRank SimRank ReCom ReCom 0.2 F-SimRank F-SimRank O(S) 10 O(S^2) 0 5 10 15 20 25 5 10 15 20 25 selectivity selectivity 04/06/12 Data Mining: Principles and Algorithms 90
    91. 91. Multirelational Data Mining  Classification over multiple-relations in databases  Clustering over multi-relations by user-guidance  LinkClus: Efficient clustering by exploring the power law distribution  Distinct: Distinguishing objects with identical names by link analysis  Mining across multiple heterogeneous data and information repositories  Summary04/06/12 Data Mining: Principles and Algorithms 91
    92. 92. People/Objects Do Share Names  Why distinguishing objects with identical names?  Different objects may share the same name  In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten”  In DBLP, 141 papers are written by at least 14 “Wei Wang”  How to distinguish the authors of the 141 papers?04/06/12 Data Mining: Principles and Algorithms 92
    93. 93. (1) (2) Wei Wang, Jiong Yang, VLDB 1997 Wei Wang, Haifeng Jiang, VLDB 2004 Richard Muntz Hongjun Lu, Jeffrey Yu Haixun Wang, Wei Wang, SIGMOD 2002 Hongjun Lu, Yidong Yuan, Wei ICDE 2005 Jiong Yang, Philip S. Yu Wang, Xuemin Lin Jiong Yang, Hwanjo Yu, Wei CSB 2003 Wang, Jiawei Han Wei Wang, Xuemin Lin ADMA 2005 Jiong Yang, Jinze Liu, Wei Wang KDD 2004 Jian Pei, Jiawei Han, ICDM 2001 Hongjun Lu, et al. Jinze Liu, Wei Wang ICDM 2004 (4) Jian Pei, Daxin Jiang, ICDE 2005 Aidong Zhang, Yuqing WWW 2003 Aidong Zhang Song, Wei Wang(3) Wei Wang, Jian Pei, CIKM 2002 Haixun Wang, Wei Wang, Baile ICDM 2005 Jiawei Han Shi, Peng Wang Yongtai Zhu, Wei Wang, Jian Pei, KDD 2004 Baile Shi, Chen Wang (1) Wei Wang at UNC (2) Wei Wang at UNSW, Australia (3) Wei Wang at Fudan Univ., China (4) Wei Wang at SUNY Buffalo04/06/12 Data Mining: Principles and Algorithms 93
    94. 94. Challenges of Object Distinction  Related to duplicate detection, but  Textual similarity cannot be used  Different references appear in different contexts (e.g., different papers), and thus seldom share common attributes  Each reference is associated with limited information  We need to carefully design an approach and use all information we have04/06/12 Data Mining: Principles and Algorithms 94
    95. 95. Overview of DISTINCT  Measure similarity between references  Linkages between references  As shown by self-loop property, references to the same object are more likely to be connected  Neighbor tuples of each reference  Can indicate similarity between their contexts  References clustering  Group references according to their similarities04/06/12 Data Mining: Principles and Algorithms 95
    96. 96. Similarity 1: Link-based Similarity  Indicate the overall strength of connections between two references  We use random walk probability between the two tuples containing the references  Random walk probabilities along different join paths are handled separately  Because different join paths have different semantic meanings  Only consider join paths of length at most 2L (L is the number of steps of propagating probabilities)04/06/12 Data Mining: Principles and Algorithms 96
    97. 97. Example of Random Walk Publish Authors 1.0 vldb/wangym97 Wei Wang 0.5 Jiong Yang 0.5 vldb/wangym97 Jiong Yang 0.5 Richard Muntz 0.5 vldb/wangym97 Richard Muntz Publications STING: A Statistical Information Grid 1.0 vldb/wangym97 Approach to Spatial Data Mining vldb/vldb97 Proceedings vldb/vldb97 Very Large Data Bases 1997 Athens, Greece 1.004/06/12 Data Mining: Principles and Algorithms 97
    98. 98. Similarity 2: Neighborhood Similarity  Find the neighbor tuples of each reference  Neighbor tuples within L joins  Weights of neighbor tuples  Different neighbor tuples have different connections to a reference  Assign each neighbor tuple a weight, which is the probability of walking from the reference to this tuple  Similarity: Set resemblance between two sets of neighbor tuples04/06/12 Data Mining: Principles and Algorithms 98

    ×