Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

3,620 views
3,537 views

Published on

In today’s interconnected real world, social and informational entities are interconnected, forming gigantic, interconnected, integrated social and information networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous social and information networks. Most real world applications that handle big data, including interconnected social media and social networks, medical information systems, online e-commerce systems, or database systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous social and information networks poses an interesting but critical challenge.

In this talk, we present a set of data mining scenarios in heterogeneous social and information networks and show that mining typed, heterogeneous networks is a new and promising research frontier in data mining research. However, such mining may raise some serious challenging problems on scalability computation. We identify a set of problems on scalable computation and calls for serious studies on such problems. This includes how to efficiently computation for (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta path-based link/relationship prediction, and (5) topical hierarchies from heterogeneous information networks. We introduce some recent efforts, discuss the trade-offs between query-independent pre-computation vs. query-dependent online computation, and point out some promising research directions.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,620
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
59
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • May replace with Jure Lescovec
  • Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

    1. 1. Challenging Problems forChallenging Problems for Scalable Mining ofScalable Mining of Heterogeneous Social andHeterogeneous Social and Information NetworksInformation Networks Jiawei Han Computer Science , University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing August 12, 2013 1
    2. 2. 2 OutlineOutline  Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks Promising?Promising?  Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks  On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and Info. NetworksInfo. Networks  Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks  PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search  PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path  Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge  ConclusionsConclusions
    3. 3. Where There Is Information,Where There Is Information, There Are Networks!There Are Networks! Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via EmailsProduct Recommendation Network via Emails
    4. 4. The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks  Multiple object types and/or multiple link types VenueVenue PaperPaper AuthorAuthor DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network ActorActor MovieMovie DirectorDirector MovieMovie StudioStudio Homogeneous networks are information lossinformation loss projection of heterogeneous networks! The Facebook NetworkThe Facebook Network Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks
    5. 5. Structured Heterogeneous Network ModelingStructured Heterogeneous Network Modeling Leads to the New Power of Data Mining!Leads to the New Power of Data Mining!  DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), … 5 Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
    6. 6. 6 OutlineOutline  Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks Promising?Promising?  Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks  On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and Info. NetworksInfo. Networks  Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks  PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search  PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path  Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge  ConclusionsConclusions
    7. 7. 7 On the Power of Mining Structured,On the Power of Mining Structured, Heterogeneous NetworksHeterogeneous Networks  Links carry a lot of hidden information in structured,Links carry a lot of hidden information in structured, heterogeneous social and information networksheterogeneous social and information networks  Effectiveness of miningEffectiveness of mining  Clustering in heterogeneous networks: Rank-basedClustering in heterogeneous networks: Rank-based clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) andclustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]  Knowledge propgation through heterogeneous linksKnowledge propgation through heterogeneous links (GNetMine [ECMLPKDD’10]) and Rank-based classification(GNetMine [ECMLPKDD’10]) and Rank-based classification (RankClass [KDD’11])(RankClass [KDD’11])  Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11])  Meta-path-based prediction in heterogeneous networksMeta-path-based prediction in heterogeneous networks (PathPredict [ASONAM’11])(PathPredict [ASONAM’11])
    8. 8. RankClus:RankClus: Integrated Clustering and RankingIntegrated Clustering and Ranking  Highly ranked objects are more important (i.e., more weighted) in a cluster than weakly ranked ones  Ranking will make more sense within one cluster than in multiple clusters  Ranking, as the feature of the cluster, is conditional to a specific cluster Sub-Network Ranking Clustering 8  Clustering and ranking mutually enhance each other at each iteration  RankClus [EDBT’09]: An efficient, EM-like algorithm
    9. 9. 9 with Star Network Schemawith Star Network Schema [KDD’09][KDD’09]  Beyond bi-typed information network: A Star Network Schema  Split a network into different layers, each representing by a net- cluster
    10. 10. 10 NetClus: Database System ClusterNetClus: Database System Cluster database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.00850744 object 0.00837766 relational 0.0081175 processing 0.00745875 based 0.00736599 distributed 0.0068367 xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167 information 0.0050518 model 0.00499396 efficient 0.00465707 Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314 Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853 Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274 Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143 Abraham Silberschatz 0.00278185 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Go one-level deeper: Authors in XML, Xquery cluster Term Venue Author
    11. 11. Rank-Based Clustering for OthersRank-Based Clustering for Others 11 RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically! Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
    12. 12. 12 Classification in Heterogeneous NetworksClassification in Heterogeneous Networks  GNetMine [ECMLPKDD'10]: Knowledge propagation across heterogeneous links  RankClass [KDD’11]: Integration of ranking and classification in heterogeneous network analysis  Highly ranked objects play more role in classification An object can only be ranked high in some focused classes  Class membership and ranking are stat. distributions  Let ranking and classification mutually enhance each other!  Output: Classification results + ranking list of objects within each class
    13. 13. Experiments with Very Small Training SetExperiments with Very Small Training Set  DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network  Rank objects within each class (with extremely limited label information)  Obtain High classification accuracy and excellent rankings within each class Database Data Mining AI IR Top-5 ranked conferences VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM Top-5 ranked terms data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text 13
    14. 14. Similarity Search: Find Similar Objects in NetworksSimilarity Search: Find Similar Objects in Networks  Who are most similar to Christos Faloutsos?  Meta-Path: Meta-level description of a path between two objects Christos’s students or close collaborators Similar reputation at similar venues Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) 14 Schema of the DBLP Network Different meta-paths lead to very different results!  Different meta-paths carry rather different semantics
    15. 15. Which Similarity Measure Is Better?Which Similarity Measure Is Better?  Anhai Doan  CS, Wisconsin  Database area  PhD: 2002 Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) • Jignesh Patel • CS, Wisconsin • Database area • PhD: 1998 • Amol Deshpande • CS, Maryland • Database area • PhD: 2004 • Jun Yang • CS, Duke • Database area • PhD: 2001 15 PathSim [VLDB’11]
    16. 16. PathPredict:PathPredict: Meta-Path Based New Co-authorMeta-Path Based New Co-author Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]  Co-authorship prediction: Whether two authors are going to collaborate for the first time  Co-authorship encoded in meta-path  Author-Paper-Author (A-P-A)  Topological features encoded in meta-paths as below: Meta-paths between authors under length 4Meta-paths between authors under length 4 Meta-Path Semantic Meaning 16
    17. 17. The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths  Explain the prediction power of each meta-path  Wald Test for logistic regression  Higher prediction accuracy than using projected homogeneous network  11% higher in prediction accuracy  Citation prediction  The selected meta- paths could be rather different 17 Co-author predictionCo-author prediction for Jian Peifor Jian Pei: Only 42 among 4809: Only 42 among 4809 candidates are true first-time co-authors!candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009])
    18. 18. 18 OutlineOutline  Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks Promising?Promising?  Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks  On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and Info. NetworksInfo. Networks  Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks  PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search  PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path  Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge  ConclusionsConclusions
    19. 19. 19 Challenges on BigMineChallenges on BigMine  Scalable mining of massive information networks: Necessity  Many such networks are gigantic: News, PubMed, …  DBLP is a small one: 2M papers and 0.8M authors, …  Meta-path: Potentially long chains of matrix multiplication of such networks  APVPA: AP X PV X VP X PA  Comparative analysis of multi-meta-paths is costly  Scalable mining of massive information networks: Possibility  Many functions do not need to compute eigen values  Top-k computation may save computation cost substantially  Precomputation may save online computation substantially  Clustering-based precomputation:
    20. 20. 20 Computing Eigen Values: When Need It?Computing Eigen Values: When Need It?  Computations needed  Clustering (RankClus), classification (RankClass), similarity search (PathSim), prediction (PathPredict)  A small # of interactive processing (e.g., EM-styled)  Meta-path-based prediction : Selection from a set of “parallel” meta-paths
    21. 21. Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics  Repeat the meta-path 2, 4, and infinite times for conference similarity query 21
    22. 22. 22 Top-K Computation Is What We NeedTop-K Computation Is What We Need  Similarity search: “Who are similar to Christos?”  There is no need/interest to calculate and rank the remaining 0.8M authors  Only top-k (e.g., top-100) authors are needed in practice  Lots of optimizations can be explored for top-k computation  Precomputation vs. online computation  Precomputation of long meta-paths will save online, costly multi-matrix multiplication  Clustering-based precomputation  Example: top-k similarity authors  Precomputation by clustering: only computing rather similar author groups
    23. 23. Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm  General idea:  Store commuting matrices for short path schemas and compute top-k queries on line  Framework  Generate co-clusters for materialized commuting matrices, for feature objects and target objects  Derive upper bound for similarity between object and target cluster, and between object and object  Safely pruning target clusters and objects if the upper bound similarity is lower than current threshold  Dynamically update top-k threshold
    24. 24. Similarity Search: Experiments on EfficiencySimilarity Search: Experiments on Efficiency  Searching for top-20 objects vs. 1001th-1020th objects: PathSim- pruning is more efficient than PathSim-baseline  The denser the corresponding commuting matrix, the more PathSim-pruning can improve  The more neighbors of a query, the more PathSim-pruning can improve  Then compare the efficiency under different top-k’s (k = 5, 10, 20) for PathSim-pruning using query set 1  A smaller top-k has stronger pruning power, and thus needs less execution time 24
    25. 25. PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space  Scalable computation in really huge heterogeneous networks?  Sampling may lead to similar judgment on importance of meta-path  Query-dependent prediction can be “selective” and thus may not need that much resources  Precomputation and clustering may further enhance its efficiency 25
    26. 26. 26 Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks  Query-relevant hidden networks  What is the hidden network closely relevant to “SVM”?  The network should contains weighted network consisting of papers, terms, authors and venues  Is “kernel machine” closely relevant to “SVM”? How could we know it?  It takes substantial computation to derive such a “weighted/ranked” hidden heterogeneous network  Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is impossible to precompute every possible combinations  How can we compute such hidden network efficiently on the fly?  An interesting open problem
    27. 27. 27 ConclusionsConclusions  Heterogeneous social & information networks are ubiquitous  Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks  Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …  Surprisingly rich knowledge can be mined from structured heterogeneous info. networks  Clustering, ranking, classification, path prediction, ……  Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!  Challenge to BigMine: How to mining massive, heterogeneous information networks efficiently  Some progress/tricks on scalability and efficiency  Many open problems and much more to be explored!
    28. 28. From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks 28 Han, Kamber and Pei, Data Mining, 3rd ed. 2011 Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012
    29. 29. ReferencesReferences  M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11.  Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012  Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09  Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09  Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11  Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11  Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in Heterogeneous Information Networks”, WSDM'12  F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”, (system demo) KDD’13  C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10  C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy”, KDD’13 29

    ×