Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

No Downloads

Total views

873

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Data Mining: Concepts and Techniques — Chapter 9 — 9.3. Multirelational Data Mining Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgements: Xiaoxin Yin04/06/12 Data Mining: Principles and Algorithms 1
- 2. Multirelational Data Mining Classification over multiple-relations in databases Clustering over multi-relations by user-guidance LinkClus: Efficient clustering by exploring the power law distribution Distinct: Distinguishing objects with identical names by link analysis Mining across multiple heterogeneous data and information repositories Summary04/06/12 Data Mining: Principles and Algorithms 2
- 3. Outline Theme: “Knowledge is power, but knowledge is hidden in massive links” Starting with PageRank and HITS CrossMine: Classification of multi-relations by link analysis CrossClus: Clustering over multi-relations by user-guidance More recent work and conclusions04/06/12 Data Mining: Principles and Algorithms 3
- 4. Traditional Data Mining Work on single “flat” relations Contact Doctor Patient flatten Lose information of linkages and relationships Cannot utilize information of database structures or schemas04/06/12 Data Mining: Principles and Algorithms 4
- 5. Multi-Relational Data Mining (MRDM) Motivation Most structured data are stored in relational databases MRDM can utilize linkage and structural information Knowledge discovery in multi-relational environments Multi-relational rules Multi-relational clustering Multi-relational classification Multi-relational linkage analysis …04/06/12 Data Mining: Principles and Algorithms 5
- 6. Applications of MRDM e-Commerce: discovering patterns involving customers, products, manufacturers, … Bioinformatics/Medical databases: discovering patterns involving genes, patients, diseases, … Networking security: discovering patterns involving hosts, connections, services, … Many other relational data sources Example: Evidence Extraction and Link Discovery (EELD): A DARPA-funding project that emphasizes multi-relational and multi-database linkage analysis04/06/12 Data Mining: Principles and Algorithms 6
- 7. Importance of Multi-relational Classification (from EELD Program Description) The objective of the EELD Program is to research, develop, demonstrate, and transition critical technology that will enable significant improvement in our ability to detect asymmetric threats …, e.g., a loosely organized terrorist group. … Patterns of activity that, in isolation, are of limited significance but, when combined, are indicative of potential threats, will need to be learned. Addressing these threats can only be accomplished by developing a new level of autonomic information surveillance and analysis to extract, discover, and link together sparse evidence from vast amounts of data sources, in different formats and with differing types and degrees of structure, to represent and evaluate the significance of the related evidence, and to learn patterns to guide the extraction, discovery, linkage and evaluation processes.04/06/12 Data Mining: Principles and Algorithms 7
- 8. MRDM Approaches Inductive Logic Programming (ILP) Find models that are coherent with background knowledge Multi-relational Clustering Analysis Clustering objects with multi-relational information Probabilistic Relational Models Model cross-relational probabilistic distributions Efficient Multi-Relational Classification The CrossMine Approach [Yin et al, 2004]04/06/12 Data Mining: Principles and Algorithms 8
- 9. Inductive Logic Programming (ILP) Find a hypothesis that is consistent with background knowledge (training data) FOIL, Golem, Progol, TILDE, … Background knowledge Relations (predicates), Tuples (ground facts) Training examples Background knowledge Parent(ann, mary) Female(ann) Daughter(mary, ann) + Parent(ann, tom) Female(mary) Daughter(eve, tom) + Parent(tom, eve) Female(eve) Daughter(tom, ann) – Parent(tom, ian) Daughter(eve, ann) –04/06/12 Data Mining: Principles and Algorithms 9
- 10. Inductive Logic Programming (ILP) Hypothesis The hypothesis is usually a set of rules, which can predict certain attributes in certain relations Daughter(X,Y) ← female(X), parent(Y,X)04/06/12 Data Mining: Principles and Algorithms 10
- 11. FOIL: First-Order Inductive Learner Find a set of rules consistent with training data E.g. female(X), parent(Y,X) → daughter(X,Y) A top-down, sequential covering learner Examples covered by Rule 2 Examples covered Examples covered by Rule 1 by Rule 3 All examples Build each rule by heuristics Foil gain – a special type of information gain04/06/12 Data Mining: Principles and Algorithms 11
- 12. ILP Approaches Top-down Approaches (e.g. FOIL) while(enough examples left) generate a rule remove examples satisfying this rule Bottom-up Approaches (e.g. Golem) Use each example as a rule Generalize rules by merging rules Decision Tree Approaches (e.g. TILDE)04/06/12 Data Mining: Principles and Algorithms 12
- 13. ILP – Pros and Cons Advantages Expressive and powerful Rules are understandable Disadvantages Inefficient for databases with complex schemas Not appropriate for continuous attributes04/06/12 Data Mining: Principles and Algorithms 13
- 14. Automatically Classifying Objects Using Multiple Relations Why not convert multiple relational data into a single table by joins? Relational databases are designed by domain experts via semantic modeling (e.g., E-R modeling) Indiscriminative joins may loose some essential information One universal relation may not be appealing to efficiency, scalability and semantics preservation Our approach to multi-relational classification: Automatically classifying objects using multiple relations04/06/12 Data Mining: Principles and Algorithms 14
- 15. An Example: Loan Applications Ask the backend database Approve or not? Apply for loan04/06/12 Data Mining: Principles and Algorithms 15
- 16. The Backend Database Account District account-id Loan district-id district-id dist-nameTarget relation: loan-id frequency account-id Card regionEach tuple has a date date card-id #peopleclass label, amount disp-id #lt-500indicating whether a duration Transaction type #lt-2000loan is paid on time. payment trans-id issue-date #lt-10000 #gt-10000 account-id #city date Disposition ratio-urban Order type disp-id avg-salary order-id operation account-id unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id How to make decisions to loan applications?04/06/12 Data Mining: Principles and Algorithms 16
- 17. Roadmap Motivation Rule-based Classification Tuple ID Propagation Rule Generation Negative Tuple Sampling Performance Study04/06/12 Data Mining: Principles and Algorithms 17
- 18. Rule-based Classification Ever bought a house Live in Chicago Approve!Applicant Just apply for a credit card Reject …Applicant04/06/12 Data Mining: Principles and Algorithms 18
- 19. Rule Generation Search for good predicates across multiple relations Loan ID Account ID Amount Duration Decision 1 124 1000 12 YesApplicant #1 2 124 4000 12 Yes 3 108 10000 24 No 4 45 12000 36 No Loan ApplicationsApplicant #2 Account ID Frequency Open date District ID 128 monthly 02/27/96 61820 108 weekly 09/23/95 61820Applicant #3 45 monthly 12/09/94 61801 Orders 67 weekly 01/01/95 61822 AccountsApplicant #4 Other relations Districts04/06/12 Data Mining: Principles and Algorithms 19
- 20. Previous Approaches Inductive Logic Programming (ILP) To build a rule Repeatedly find the best predicate To evaluate a predicate on relation R, first join target relation with R Not scalable because Huge search space (numerous candidate predicates) Not efficient to evaluate each predicate To evaluate a predicate Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, ‘monthly’,?) first join loan relation with account relation CrossMine is more scalable and more than one hundred times faster on datasets with reasonable sizes04/06/12 Data Mining: Principles and Algorithms 20
- 21. CrossMine: An Efficient and Accurate Multi-relational Classifier Tuple-ID propagation: an efficient and flexible method for virtually joining relations Confine the rule search process in promising directions Look-one-ahead: a more powerful search strategy Negative tuple sampling: improve efficiency while maintaining accuracy04/06/12 Data Mining: Principles and Algorithms 21
- 22. Roadmap Motivation Rule-based Classification Tuple ID Propagation Rule Generation Negative Tuple Sampling Performance Study04/06/12 Data Mining: Principles and Algorithms 22
- 23. Tuple ID Propagation Loan ID Account ID Amount Duration Decision 1 124 1000 12 Yes 2 124 4000 12 YesApplicant #1 3 108 10000 24 No 4 45 12000 36 NoApplicant #2 Account ID Frequency Open date Propagated ID Labels 124 monthly 02/27/93 1, 2 2+, 0– 108 weekly 09/23/97 3 0+, 1– 45 monthly 12/09/96 4 0+, 1–Applicant #3 67 weekly 01/01/97 Null 0+, 0– Possible predicates: •Frequency=‘monthly’: 2 +, 1 – •Open date < 01/01/95: 2 +, 0 –Applicant #4 Propagate tuple IDs of target relation to non-target relations Virtually join relations to avoid the high cost of physical joins04/06/12 Data Mining: Principles and Algorithms 23
- 24. Tuple ID Propagation (cont.) Efficient Only propagate the tuple IDs Time and space usage is low Flexible Can propagate IDs among non-target relations Many sets of IDs can be kept on one relation, which are propagated from different join paths Target R1 R2 Relation R304/06/12 Data Mining: Principles and Algorithms 24
- 25. Roadmap Motivation Rule-based Classification Tuple ID Propagation Rule Generation Negative Tuple Sampling Performance Study04/06/12 Data Mining: Principles and Algorithms 25
- 26. Overall Procedure Sequential covering algorithm while(enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered Examples covered by Rule 2 by Rule 1 Examples covered by Rule 3 Positive examples04/06/12 Data Mining: Principles and Algorithms 26
- 27. Rule Generation To generate a rule while(true) find the best predicate p if foil-gain(p)>threshold then add p to current rule else break A3=1&&A1=2 A3=1&&A1=2 &&A8=5A3=1 Positive Negative examples examples04/06/12 Data Mining: Principles and Algorithms 27
- 28. Evaluating Predicates All predicates in a relation can be evaluated based on propagated IDs Use foil-gain to evaluate predicates Suppose current rule is r. For a predicate p, foil-gain(p) = P( r ) P( r + p ) P( r + p ) × − log + log P( r ) + N ( r ) P( r + p ) + N ( r + p ) Categorical Attributes Compute foil-gain directly Numerical Attributes Discretize with every possible value04/06/12 Data Mining: Principles and Algorithms 28
- 29. Rule Generation Start from the target relation Only the target relation is active Repeat Search in all active relations Search in all relations joinable to active relations Add the best predicate to the current rule Set the involved relation to active Until The best predicate does not have enough gain Current rule is too long04/06/12 Data Mining: Principles and Algorithms 29
- 30. Rule Generation: Example Account District account-idTarget relation Loan district-id district-id dist-name loan-id frequency region account-id Card date card-id #people date amount First predicate disp-id #lt-500 duration type #lt-2000 Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city date Disposition ratio-urban Order type disp-id avg-salary order-id operation account-id unemploy95 account-id amount client-id Second unemploy96 bank-to predicate balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date genderRange of Search district-id Add best predicate to rule04/06/12 Data Mining: Principles and Algorithms 30
- 31. Look-one-ahead in Rule Generation Two types of relations: Entity and Relationship Often cannot find useful predicates on relations of relationship No good predicate Target Relation Solution of CrossMine: When propagating IDs to a relation of relationship, propagate one more step to next relation of entity.04/06/12 Data Mining: Principles and Algorithms 31
- 32. Roadmap Motivation Rule-based Classification Tuple ID Propagation Rule Generation Negative Tuple Sampling Performance Study04/06/12 Data Mining: Principles and Algorithms 32
- 33. Negative Tuple Sampling A rule covers some positive examples Positive examples are removed after covered After generating many rules, there are much less positive examples than negative ones – – + + – – + + – + – + + – + – +– + – + – – – – + – – + + – + +04/06/12 Data Mining: Principles and Algorithms 33
- 34. Negative Tuple Sampling (cont.) When there are much more negative examples than positive ones Cannot build good rules (low support) Still time consuming (large number of negative examples) Make sampling on negative examples Improve efficiency without affecting rule quality – – – – – + – – – – – – – – – – – + + – + +04/06/12 Data Mining: Principles and Algorithms 34
- 35. Roadmap Motivation Rule-based Classification Tuple ID Propagation Rule Generation Negative Tuple Sampling Performance Study04/06/12 Data Mining: Principles and Algorithms 35
- 36. Synthetic datasets: Scalability w.r.t. number of relations Scalability w.r.t. number of tuples04/06/12 Data Mining: Principles and Algorithms 36
- 37. Real Dataset PKDD Cup 99 dataset – Loan Application Accuracy Time (per fold) FOIL 74.0% 3338 sec TILDE 81.3% 2429 sec CrossMine 90.7% 15.3 sec Mutagenesis dataset (4 relations) Accuracy Time (per fold) FOIL 79.7% 1.65 sec TILDE 89.4% 25.6 sec CrossMine 87.7% 0.83 sec04/06/12 Data Mining: Principles and Algorithms 37
- 38. Multi-Relational Classification: Summary Classification across multiple relations Interesting pieces of information often lie across multiple relations It is desirable to mine across multiple interconnected relations New methodology in CrossMine (for classification model building) ID (and class label) propagation leads to efficiency and effectiveness (by preserving semantics) in CrossMine Rule generation and negative tuple sampling lead to further improved performance Our performance study shows orders of magnitude faster and high accuracy comparing with the Relational Mining approach Future work: classification in heterogeneous relational databases04/06/12 Data Mining: Principles and Algorithms 38
- 39. Multirelational Data Mining Classification over multiple-relations in databases Clustering over multi-relations by user-guidance LinkClus: Efficient clustering by exploring the power law distribution Distinct: Distinguishing objects with identical names by link analysis Mining across multiple heterogeneous data and information repositories Summary04/06/12 Data Mining: Principles and Algorithms 39
- 40. Multi-Relational and Multi-DB Mining Classification over multiple-relations in databases Clustering over multi-relations by User-Guidance Mining across multi-relational databases Mining across multiple heterogeneous data and information repositories Summary04/06/12 Data Mining: Principles and Algorithms 40
- 41. Motivation 1: Multi-Relational Clustering Work-In Professor Open-course Course person name course course-id group office semester name position instructor area Publication Publish Advise title Group author professor year name title student conf area degree Student Register name student Target of office clustering course position semester unit grade Traditional clustering works on a single table Most data is semantically linked with multiple relations Thus we need information in multiple relations04/06/12 Data Mining: Principles and Algorithms 41
- 42. Motivation 2: User-Guided Clustering Work-In Professor Open-course Course person name course course-id group office semester name position instructor area Publish Publication Group Advise author title professor name title year student area conf degree Register User hint student Student course name semester Target of office unit clustering position grade User usually has a goal of clustering, e.g., clustering students by research area User specifies his clustering goal to CrossClus04/06/12 Data Mining: Principles and Algorithms 42
- 43. Comparing with Classification User hint User-specified feature (in the form of attribute) is used as a hint, not class labels The attribute may contain too many or too few distinct values E.g., a user may want to cluster students into 20 clusters instead of 3 Additional features need to be All tuples for clustering included in cluster analysis04/06/12 Data Mining: Principles and Algorithms 43
- 44. Comparing with Semi-supervised Clustering Semi-supervised clustering [Wagstaff, et al’ 01, Xing, et al.’02] User provides a training set consisting of “similar” and “dissimilar” pairs of objects User-guided clustering User specifies an attribute as a hint, and more relevant features are found for clustering Semi-supervised clustering User-guided clustering x All tuples for clustering All tuples for clustering04/06/12 Data Mining: Principles and Algorithms 44
- 45. Semi-supervised Clustering Much information (in multiple relations) is needed to judge whether two tuples are similar A user may not be able to provide a good training set It is much easier for a user to specify an attribute as a hint, such as a student’s research area Tom Smith SC1211 TA Jane Chang BI205 RA Tuples to be compared User hint04/06/12 Data Mining: Principles and Algorithms 45
- 46. CrossClus: An Overview Use a new type of multi-relational features of clustering Measure similarity between features by how they cluster objects into groups Use a heuristic method to search for pertinent features Use a k-medoids-based algorithm for clustering04/06/12 Data Mining: Principles and Algorithms 46
- 47. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 47
- 48. Multi-relational Features A multi-relational feature is defined by: A join path. E.g., Student → Register → OpenCourse → Course An attribute. E.g., Course.area (For numerical feature) an aggregation operator. E.g., sum or average Categorical Feature f = [Student → Register → OpenCourse → Course, Course.area, null]areas of courses of each student Values of feature f f(t1) Tuple Areas of courses Tuple Feature f f(t2) DB AI TH DB AI TH t1 5 5 0 t1 0.5 0.5 0 DB t2 0 3 7 t2 0 0.3 0.7 f(t3) AI t3 1 5 4 t3 0.1 0.5 0.4 f(t4) TH t4 5 0 5 t4 0.5 0 0.5 t5 3 3 4 t5 0.3 0.3 0.4 f(t5) Numerical Feature, e.g., average grades of students h = [Student → Register, Register.grade, average] E.g. h(t1) = 3.504/06/12 Data Mining: Principles and Algorithms 48
- 49. Representing Features Most important information of a feature f is how f clusters objects into groups f is represented by similarities between every pair of objects indicated by f Similarity vector Vf Similarity between each pair of tuples indicated by f. The horizontal axes are the tuple indices, and the vertical 1 axis is the similarity. 0.9 0.5-0.6 0.8 0.4-0.5 This can be considered as a 0.7 0.3-0.4 0.6 0.2-0.3 vector of N x N dimensions. 0.5 5 0.1-0.2 0.4 0.3 4 0-0.1 0.2 0.1 3 0 S5 2 S4 S3 S2 1 S104/06/12 Data Mining: Principles and Algorithms 49
- 50. Similarity between Tuples Categorical feature f Defined as the probability t and t having the same value 1 2 e.g. when each of them selects another course, what is the probability they select courses of the same area L sim f ( t1 , t 2 ) = ∑ f ( t1 ). pk ⋅ f ( t 2 ). pk k =1 DB AI TH simf(t1,t2)=0.5*0.3+0.5*0.3=0.3 Numerical feature h h( t ) − h( t 2 ) 1− 1 Simh ( t1 , t 2 ) = ( ) ( ) σ h , if h t1 − h t 2 < σ h 0, otherwise 04/06/12 Data Mining: Principles and Algorithms 50
- 51. Similarity Between Features Vf Values of Feature f and g Feature f (course) Feature g (group) 1 0.9 0.5-0.6 0.8 DB AI TH Info sys Cog sci Theory 0.7 0.4-0.5 0.3-0.4 0.6 0.2-0.3 0.5 5 t1 0.5 0.5 0 1 0 0 0.4 0.3 4 0.1-0.2 0-0.1 0.2 t2 0 0.3 0.7 0 0 1 0.1 0 3 S5 2 t3 0.1 0.5 0.4 0 0.5 0.5 S4 S3 S2 1 S1 t4 0.5 0 0.5 0.5 0 0.5 t5 0.3 0.3 0.4 0.5 0.5 0 Vg Similarity between two features – 1 cosine similarity of two vectors 0.9 0.8 0.5-0.6 0.7 V ⋅V 0.4-0.5 f g 0.6 Sim( f , g ) = f g 0.3-0.4 0.5 0.2-0.3 0.4 0.1-0.2 0.3 1 0-0.1 V V 0.2 0.1 0 3 2 S1 S2 4 S3 S4 5 S504/06/12 Data Mining: Principles and Algorithms 51
- 52. Computing Feature Similarity ObjectsFeature f Feature g Similarity between feature values w.r.t. the objects DB Info sys sim(fk,gq)=Σi=1 to N f(ti).pk∙g(ti).pq AI Cog sci DB Info sys TH Theory 2 V f ⋅ V g = ∑∑Objectf similarities, g ( ti , t j ) = ∑∑ sim( fsimilarities, sim ( ti , t j ) ⋅ sim Feature value k , g q ) N N l m i =1 j =1 hard to compute k =1 q =easy to compute 1 DB Info sys Compute similarity between each pair of AI Cog sci feature values by one scan on data TH Theory04/06/12 Data Mining: Principles and Algorithms 52
- 53. Similarity between Categorical and Numerical FeaturesV ⋅V = 2∑∑ sim h ( ti , t j ) ⋅ sim f ( ti , t j ) N h f i =1 j <i = 2∑ ∑ f ( ti ). pk (1 − h( ti ) ) ∑ f ( t j ). pk + 2∑ ∑ f ( ti ). pk ∑ h( t j ) ⋅ f ( t j ). pk N l N l i =1 k =1 j <i i =1 k =1 j <i Only depend on ti Depend on all tj with j<i ObjectsFeature f (ordered by h) Feature h DB 2.7 Parts depending on ti 2.9 3.1 AI 3.3 3.5 Parts depending on all ti with j<i TH 3.7 3.904/06/12 Data Mining: Principles and Algorithms 53
- 54. Similarity between Numerical Features Similarity between numerical features h and g Suppose objects are ordered according to h Computing Vh∙Vg Scan objects in order of h When scanning each object t*, maintain the set of objects t with 0<h(t*)-h(t)<σh, in a binary tree sorted with g Update Vh∙Vg by all t with 0<h(t*)-h(t*)<σh, |g(t)-g(t*)|<σg Feature h Objects 2.7 A search tree containing 2.9 objects t with 0<h(t*)-h(t)<σh, 3.1 t* which is sorted in g(t) 3.3 3.5 t* 3.7 Objects t satisfying 3.9 0<h(t*)-h(t)<σh, |g(t)-g(t*)|<σg04/06/12 Data Mining: Principles and Algorithms 54
- 55. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 55
- 56. Searching for Pertinent Features Different features convey different aspects of information Research area Academic Performances Research group area Demographic info GPA Conferences of papers Permanent address GRE score Advisor Nationality Number of papers Features conveying same aspect of information usually cluster objects in more similar ways research group areas vs. conferences of publications Given user specified feature Find pertinent features by computing feature similarity04/06/12 Data Mining: Principles and Algorithms 56
- 57. Heuristic Search for Pertinent Features Work-In Professor Open-course Course person name course course-idOverall procedure group office semester name1.Start from the user- 2 position instructor area specified feature Advise Publication Group Publish2. Search in neighborhood name professor author title of existing pertinent area student 1 title year features degree conf Register3. Expand search range User hint student gradually Student course name semester office Target of unit clustering position grade Tuple ID propagation [Yin, et al.’04] is used to create multi-relational features IDs of target tuples can be propagated along any join path, from which we can find tuples joinable with each target tuple04/06/12 Data Mining: Principles and Algorithms 57
- 58. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 58
- 59. Clustering with Multi-Relational Feature Given a set of L pertinent features f1, …, fL, similarity between two objects L sim( t1 , t 2 ) = ∑ sim f i ( t1 , t 2 ) ⋅ f i .weight i =1 Weight of a feature is determined in feature search by its similarity with other pertinent features For clustering, we use CLARANS, a scalable k-medoids [Ng & Han’94] algorithm04/06/12 Data Mining: Principles and Algorithms 59
- 60. Roadmap n Overview n Feature Pertinence n Searching for Features n Clustering n Experimental Results04/06/12 Data Mining: Principles and Algorithms 60
- 61. Experiments: Compare CrossClus with Baseline: Only use the user specified feature PROCLUS [Aggarwal, et al. 99]: a state-of-the-art subspace clustering algorithm Use a subset of features for each cluster We convert relational database to a table by propositionalization User-specified feature is forced to be used in every cluster RDBC [Kirsten and Wrobel’00] A representative ILP clustering algorithm Use neighbor information of objects for clustering User-specified feature is forced to be used04/06/12 Data Mining: Principles and Algorithms 61
- 62. Clustering Accuracy To verify that CrossClus captures user’s clustering goal, we define “accuracy” of clustering Given a clustering task Manually find all features that contain information directly related to the clustering task – standard feature set E.g., Clustering students by research areas Standard feature set: research group, group areas, advisors, conferences of publications, course areas Accuracy of clustering result: how similar it is to the clustering generated by standard feature set deg( C ⊂ C ) = ∑ n i =1 ( max1≤ j ≤n ci ∩ c j ) ∑ n i =1 ci deg( C ⊂ C ) + deg( C ⊂ C ) sim( C , C ) = 204/06/12 Data Mining: Principles and Algorithms 62
- 63. Measure of Clustering Accuracy Accuracy Measured by manually labeled data We manually assign tuples into clusters according to their properties (e.g., professors in different research areas) Accuracy of clustering: Percentage of pairs of tuples in the same cluster that share common label This measure favors many small clusters We let each approach generate the same number of clusters04/06/12 Data Mining: Principles and Algorithms 63
- 64. CS Dept Dataset Clustering Accuracy - CS Dept 1 0.8 CrossClus K-Medoids CrossClus K-Means 0.6 CrossClus Agglm 0.4 Baseline PROCLUS 0.2 RDBC 0 Group Course Group+Course (Theory): J. Erickson, S. Har-Peled, L. Pitt, E. Ramos, D. Roth, M. Viswanathan (Graphics): J. Hart, M. Garland, Y. Yu (Database): K. Chang, A. Doan, J. Han, M. Winslett, C. Zhai (Numerical computing): M. Heath, T. Kerkhoven, E. de Sturler (Networking & QoS): R. Kravets, M. Caccamo, J. Hou, L. Sha (Artificial Intelligence): G. Dejong, M. Harandi, J. Ponce, L. Rendell (Architecture): D. Padua, J. Torrellas, C. Zilles, S. Adve, M. Snir, D. Reed, V. Adve (Operating Systems): D. Mickunas, R. Campbell, Y. Zhou04/06/12 Data Mining: Principles and Algorithms 64
- 65. DBLP Dataset Clustering Accurarcy - DBLP 10.90.80.7 CrossClus K-Medoids0.6 CrossClus K-Means0.5 CrossClus Agglm0.4 Baseline PROCLUS0.3 RDBC0.20.1 0 e d nf d r or or re ho or or Co th th th W W ut au au ll oa + Co Co A nf +C Co + nf d or Co W04/06/12 Data Mining: Principles and Algorithms 65
- 66. Scalability w.r.t. Data Size and # of Relations04/06/12 Data Mining: Principles and Algorithms 66
- 67. CrossClus: Summary User guidance, even in a very simple form, plays an important role in multi-relational clustering CrossClus finds pertinent features by computing similarities between features04/06/12 Data Mining: Principles and Algorithms 67
- 68. Multirelational Data Mining Classification over multiple-relations in databases Clustering over multi-relations by user-guidance LinkClus: Efficient clustering by exploring the power law distribution Distinct: Distinguishing objects with identical names by link analysis Mining across multiple heterogeneous data and information repositories Summary04/06/12 Data Mining: Principles and Algorithms 68
- 69. Link-Based Clustering: Motivation Authors Proceedings Conferences Tom sigmod03 sigmod04 sigmod Mike sigmod05 vldb03 Cathy vldb04 vldb John vldb05 aaai04 aaai Mary aaai05 Questions: Q1: How to cluster each type of objects? Q2: How to define similarity between each type of objects?04/06/12 Data Mining: Principles and Algorithms 69
- 70. Link-Based Similarities Two objects are similar if they are linked with same or similar objects Jeh & Widom, 2002 - SimRank sigmod03 Tom sigmod04 sigmod The similarity between two objects x Mary and y is defined as the average sigmod05 similarity between objects linked with x and those with y. Tom sigmod03 But: It is expensive to compute: sigmod04 sigmod Mike For a dataset of N objects and M sigmod05 links, it takes O(N2) space and vldb03 O(M2) time to compute all Cathy similarities. vldb04 vldb John vldb0504/06/12 Data Mining: Principles and Algorithms 70
- 71. Observation 1: Hierarchical Structures Hierarchical structures often exist naturally among objects (e.g., taxonomy of animals) Relationships between articles and A hierarchical structure of words (Chakrabarti, Papadimitriou, products in Walmart Modha, Faloutsos, 2004) All Articlesgrocery electronics apparel TV DVD camera Words04/06/12 Data Mining: Principles and Algorithms 71
- 72. Observation 2: Distribution of Similarity 0.4 Distribution of SimRank similarities portion of entries 0.3 among DBLP authors 0.2 0.1 0 0.04 0.18 0.24 0.02 0.06 0.08 0.12 0.14 0.16 0.22 0.1 0.2 0 similarity value Power law distribution exists in similarities 56% of similarity entries are in [0.005, 0.015] 1.4% of similarity entries are larger than 0.1 Our goal: Design a data structure that stores the significant similarities and compresses insignificant ones04/06/12 Data Mining: Principles and Algorithms 72
- 73. Our Data Structure: SimTree Each leaf node Each non-leaf noderepresents an object represents a group of similar lower-level nodes Similarities between siblings are stored Canon A40 digital camera Digital Sony V3 digital Cameras Consumer Apparels camera electronics TVs 04/06/12 Data Mining: Principles and Algorithms 73
- 74. Similarity Defined by SimTree Similarity between two sibling nodes n1 and n2 n1 0.2 n2 n3 0.8 0.9 0.9 Adjustment ratio for node n7 n4 0.3 n5 n6 0.9 0.8 1.0 n7 n8 n9 simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8) Path-based node similarity Similarity between two nodes is the average similarity between objects linked with them in other SimTrees Average similarity between x and all other nodes Adjustment ratio for x = Average similarity between x’s parent and all other nodes04/06/12 Data Mining: Principles and Algorithms 74
- 75. Overview of LinkClus Initialize a SimTree for objects of each type Repeat For each SimTree, update the similarities between its nodes using similarities in other SimTrees Similarity between two nodes x and y is the average similarity between objects linked with them Adjust the structure of each SimTree Assign each node to the parent node that it is most similar to04/06/12 Data Mining: Principles and Algorithms 75
- 76. Initialization of SimTrees Initializing a SimTree Repeatedly find groups of tightly related nodes, which are merged into a higher-level node Tightness of a group of nodes For a group of nodes {n , …, n }, its tightness is 1 k defined as the number of leaf nodes in other SimTrees that are connected to all of {n1, …, nk} Nodes Leaf nodes in another SimTree n1 1 2 The tightness of {n1, n2} is 3 3 n2 4 504/06/12 Data Mining: Principles and Algorithms 76
- 77. (continued) Finding tight groups Frequent pattern mining Reduced to Transactions n1 1 {n1}The tightness of a g1 2 {n1, n2}group of nodes is the n2 3 {n2} 4 {n1, n2}support of a frequent 5 {n1, n2}pattern n3 6 {n2, n3, g2 7 n4} {n4} n4 8 {n3, n4} 9 {n3, n4} Procedure of initializing a tree Start from leaf nodes (level-0) At each level l, find non-overlapping groups of similar nodes with frequent pattern mining04/06/12 Data Mining: Principles and Algorithms 77
- 78. Updating Similarities Between Nodes The initial similarities can seldom capture the relationships between objects Iteratively update similarities Similarity between two nodes is the average similarity between objects linked with them 0 ST2 1 2 3 sim(na,nb) = 10 13 4 5 6 7 8 9 average similarity between 11 and 14 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 12 takes O(3x2) time z ST1 c d e a b f g h k l m n o p q r s t u v w x y04/06/12 Data Mining: Principles and Algorithms 78
- 79. Aggregation-Based Similarity Computation 0.2 4 5 ST2 0.9 1.0 0.8 0.9 1.0 10 11 12 13 14 a b ST1 For each node nk∈{n10,n11,n12} and nl∈{n13,n14}, their path-based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl). ∑k =10 s( nk , n4 ) ∑ s ( nl , n5 ) 12 14 sim( na , nb ) = ⋅ s( n , n ) ⋅ 4 5 l =13 = 0.171 3 2 takes O(3+2) time After aggregation, we reduce quadratic time computation to linear time computation.04/06/12 Data Mining: Principles and Algorithms 79
- 80. Computing Similarity with Aggregation Average similarity a: b:(0.95,2) and total weight (0.9,3) 4 0.2 5 sim(na, nb) can be computed 10 11 12 13 14 from aggregated similarities a b sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5) = 0.9 x 0.2 x 0.95 = 0.171 To compute sim(na,nb): Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb with nj. Calculate similarity (and weight) between na and nb w.r.t. ni and nj. Calculate weighted average similarity between na and nb w.r.t. all such pairs.04/06/12 Data Mining: Principles and Algorithms 80
- 81. Adjusting SimTree Structures n1 n2 n3 0.9 n4 n5 n6 0.8 n7 n7 n8 n9 After similarity changes, the tree structure also needs to be changed If a node is more similar to its parent’s sibling, then move it to be a child of that sibling Try to move each node to its parent’s sibling that it is most similar to, under the constraint that each parent node can have at most c children04/06/12 Data Mining: Principles and Algorithms 81
- 82. Complexity For two types of objects, N in each, and M linkages between them. Time Space Updating similarities O(M(logN)2) O(M+N) Adjusting tree structures O(N) O(N) LinkClus O(M(logN)2) O(M+N) SimRank O(M2) O(N2)04/06/12 Data Mining: Principles and Algorithms 82
- 83. Empirical Study Generating clusters using a SimTree Suppose K clusters are to be generated Find a level in the SimTree that has number of nodes closest to K Merging most similar nodes or dividing largest nodes on that level to get K clusters Accuracy Measured by manually labeled data Accuracy of clustering: Percentage of pairs of objects in the same cluster that share common label Efficiency and scalability Scalability w.r.t. number of objects, clusters, and linkages04/06/12 Data Mining: Principles and Algorithms 83
- 84. Experiment Setup DBLP dataset: 4170 most productive authors, and 154 well-known conferences with most proceedings Manually labeled research areas of 400 most productive authors according to their home pages (or publications) Manually labeled areas of 154 conferences according to their call for papers Approaches Compared: SimRank (Jeh & Widom, KDD 2002) Computing pair-wise similarities SimRank with FingerPrints (F-SimRank) Fogaras & R´acz, WWW 2005 pre-computes a large sample of random paths from each object and uses samples of two objects to estimate SimRank similarity ReCom (Wang et al. SIGIR 2003) Iteratively clustering objects using cluster labels of linked objects04/06/12 Data Mining: Principles and Algorithms 84
- 85. 1 Accuracy 0.8 Conferences Authors 0.7 0.95 0.6accuracy 0.5 accuracy 0.9 0.4 LinkClus 0.3 0.85 LinkClus SimRank SimRank ReCom 0.2 ReCom 0.8 F-SimRank 0.1 F-SimRank 11 13 15 17 19 1 3 5 7 9 11 13 15 17 19 1 3 5 7 9 #iteration #iteration Approaches Accr-Author Accr-Conf average time LinkClus 0.957 0.723 76.7 SimRank 0.958 0.760 1020 ReCom 0.907 0.457 43.1 F-SimRank 0.908 0.583 83.6 04/06/12 Data Mining: Principles and Algorithms 85
- 86. (continued) 1 0.8 0.9 0.6 AccuracyAccuracy 0.8 0.4 LinkClus LinkClus SimRank SimRank 0.7 ReCom 0.2 ReCom F-SimRank F-SimRank P-SimRank P-SimRank 0.6 0 10 100 1000 10000 100000 10 100 1000 10000 100000 Time (sec) Time (sec) Accuracy vs. Running time LinkClus is almost as accurate as SimRank (most accurate), and is much more efficient04/06/12 Data Mining: Principles and Algorithms 86
- 87. Email Dataset F. Nielsen. Email dataset. http://www.imm.dtu.dk/∼rem/ data/Email-1431.zip 370 emails on conferences, 272 on jobs, and 789 spam emails Total time Approach Accuracy (sec) LinkClus 0.8026 1579.6 SimRank 0.7965 39160 ReCom 0.5711 74.6 F-SimRank 0.3688 479.7 CLARANS 0.4768 8.5504/06/12 Data Mining: Principles and Algorithms 87
- 88. Scalability (1) Tested on synthetic datasets, with randomly generated clusters Scalability w.r.t. number of objects Number of clusters is fixed (40) LinkClus 0.8 10000 SimRank LinkClus ReCom 0.7 SimRank F-SimRank ReCom O(N) 0.6 F-SimRank 1000 O(N*(logN)^2) O(N^2) 0.5time (sec) Accuracy 0.4 100 0.3 0.2 0.1 10 0 1000 2000 3000 4000 5000 #objects per relation 1000 2000 3000 4000 5000 #objects per relation 04/06/12 Data Mining: Principles and Algorithms 88
- 89. Scalability (2) Scalability w.r.t. number of objects & clusters Each cluster has fixed size (100 objects) 10000 0.8 LinkClus SimRank 0.7 ReCom 1000 0.6 F-SimRank 0.5 Accuracytime (sec) 100 LinkClus 0.4 SimRank 0.3 ReCom 10 F-SimRank 0.2 O(N) O(N*(logN)^2) 0.1 1 O(N^2) 0 500 1000 2000 5000 10000 20000 500 1000 2000 5000 10000 20000 #objects per relation #objects per relation04/06/12 Data Mining: Principles and Algorithms 89
- 90. Scalability (3) Scalability w.r.t. number of linkages from each object 10000 1 0.8 1000 0.6time (sec) Accuracy LinkClus 0.4 LinkClus 100 SimRank SimRank ReCom ReCom 0.2 F-SimRank F-SimRank O(S) 10 O(S^2) 0 5 10 15 20 25 5 10 15 20 25 selectivity selectivity 04/06/12 Data Mining: Principles and Algorithms 90
- 91. Multirelational Data Mining Classification over multiple-relations in databases Clustering over multi-relations by user-guidance LinkClus: Efficient clustering by exploring the power law distribution Distinct: Distinguishing objects with identical names by link analysis Mining across multiple heterogeneous data and information repositories Summary04/06/12 Data Mining: Principles and Algorithms 91
- 92. People/Objects Do Share Names Why distinguishing objects with identical names? Different objects may share the same name In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten” In DBLP, 141 papers are written by at least 14 “Wei Wang” How to distinguish the authors of the 141 papers?04/06/12 Data Mining: Principles and Algorithms 92
- 93. (1) (2) Wei Wang, Jiong Yang, VLDB 1997 Wei Wang, Haifeng Jiang, VLDB 2004 Richard Muntz Hongjun Lu, Jeffrey Yu Haixun Wang, Wei Wang, SIGMOD 2002 Hongjun Lu, Yidong Yuan, Wei ICDE 2005 Jiong Yang, Philip S. Yu Wang, Xuemin Lin Jiong Yang, Hwanjo Yu, Wei CSB 2003 Wang, Jiawei Han Wei Wang, Xuemin Lin ADMA 2005 Jiong Yang, Jinze Liu, Wei Wang KDD 2004 Jian Pei, Jiawei Han, ICDM 2001 Hongjun Lu, et al. Jinze Liu, Wei Wang ICDM 2004 (4) Jian Pei, Daxin Jiang, ICDE 2005 Aidong Zhang, Yuqing WWW 2003 Aidong Zhang Song, Wei Wang(3) Wei Wang, Jian Pei, CIKM 2002 Haixun Wang, Wei Wang, Baile ICDM 2005 Jiawei Han Shi, Peng Wang Yongtai Zhu, Wei Wang, Jian Pei, KDD 2004 Baile Shi, Chen Wang (1) Wei Wang at UNC (2) Wei Wang at UNSW, Australia (3) Wei Wang at Fudan Univ., China (4) Wei Wang at SUNY Buffalo04/06/12 Data Mining: Principles and Algorithms 93
- 94. Challenges of Object Distinction Related to duplicate detection, but Textual similarity cannot be used Different references appear in different contexts (e.g., different papers), and thus seldom share common attributes Each reference is associated with limited information We need to carefully design an approach and use all information we have04/06/12 Data Mining: Principles and Algorithms 94
- 95. Overview of DISTINCT Measure similarity between references Linkages between references As shown by self-loop property, references to the same object are more likely to be connected Neighbor tuples of each reference Can indicate similarity between their contexts References clustering Group references according to their similarities04/06/12 Data Mining: Principles and Algorithms 95
- 96. Similarity 1: Link-based Similarity Indicate the overall strength of connections between two references We use random walk probability between the two tuples containing the references Random walk probabilities along different join paths are handled separately Because different join paths have different semantic meanings Only consider join paths of length at most 2L (L is the number of steps of propagating probabilities)04/06/12 Data Mining: Principles and Algorithms 96
- 97. Example of Random Walk Publish Authors 1.0 vldb/wangym97 Wei Wang 0.5 Jiong Yang 0.5 vldb/wangym97 Jiong Yang 0.5 Richard Muntz 0.5 vldb/wangym97 Richard Muntz Publications STING: A Statistical Information Grid 1.0 vldb/wangym97 Approach to Spatial Data Mining vldb/vldb97 Proceedings vldb/vldb97 Very Large Data Bases 1997 Athens, Greece 1.004/06/12 Data Mining: Principles and Algorithms 97
- 98. Similarity 2: Neighborhood Similarity Find the neighbor tuples of each reference Neighbor tuples within L joins Weights of neighbor tuples Different neighbor tuples have different connections to a reference Assign each neighbor tuple a weight, which is the probability of walking from the reference to this tuple Similarity: Set resemblance between two sets of neighbor tuples04/06/12 Data Mining: Principles and Algorithms 98

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment