Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A survey of heterogeneous information network analysis


Published on

A Survey of Heterogeneous Information Network Analysis
Chuan Shi, Member, IEEE,
Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE,
and Philip S. Yu, Fellow, IEEE

Published in: Data & Analytics
  • Be the first to comment

A survey of heterogeneous information network analysis

  1. 1. A Survey of Heterogeneous Information Network Analysis Chuan Shi, Member, IEEE, Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE, and Philip S. Yu, Fellow, IEEE IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015
  2. 2. Introduction
  3. 3. Introduction Information networks is the interacting components which constitute interconnected networks Information network analysis has become a hot research topic in data mining and information retrieval fields in the past decades Most of information network have a basic assumption: the type of objects or links is unique -> Homogeneous information network
  4. 4. Introduction But most real systems consist of a large number of interacting, multi typed components and we can model them as Heterogeneous information network(HIN). Compared to homogeneous information network The HIN can effectively fuse more information and contain rich semantics, and thus it forms a new development of data mining. In this paper the author presents survey of Heterogeneous information network and analysis.
  5. 5. Basic concepts and Definitions
  6. 6. Basic definitions(1/4) •Def 1. Information network directed graph G = (V,E) mapping function object type : link type : belongs to object type set : belongs to link type set :
  7. 7. Basic definitions(2/4) •Def 2. Hetero/Homogeneous information network Heterogeneous information network if the types of objects or the types of relations Otherwise, it is a homogeneous information network.
  8. 8. Basic definitions(2/4)
  9. 9. Basic definitions(3/4) •Def 3. Network schema Meta template for an information network G=(V,E) The network schema of a heterogeneous information network specifies type constraints on the sets of objects and relationships among the objects. ※ Network instance An information network following a network schema
  10. 10. Basic definitions(3/4)
  11. 11. •Def 4. Meta path A meta path P is a path defined on a schema and is denoted in the form of which defines a composite relation between objects where denotes the composition operator on relations. Basic definitions(4/4)
  12. 12. Basic definitions(4/4)
  13. 13. Comparisons with related concepts • HIN ⊃ Homogeneous network • HIN ⊃ Multi-relational network • HIN ⊃ Multi-dimensional/mode network • HIN ⊃ Composite network • HIN ≒ Complex network
  14. 14. Example datasets Three types of data that can be constructed HIN 1. Structured data a. database table organized with entity-relation model b. ex) bibliographic data 2. Semi structured data a. XML format data b. object -> attribute -> object c. relation -> connections among attributes 3. Non structured data a. Any data which have recognizable entities and extractable relations
  15. 15. Example datasets Widely used HIN examples 1. Multi-relational network with single typed object a. Object type = 1 b. Relation type >1 c. ex) Facebook, Twitter
  16. 16. Example datasets Widely used HIN examples 2. Bipartite network a. Object type = 2 b. Relation type > 1 c. ex) User-item, Document-word d. k-partite graph can be constructed
  17. 17. Example datasets Widely used HIN examples 3. Star-schema network a. HIN that using the target object as a hub node b. ex) Bibliographic information network Movie, Patent data
  18. 18. Example datasets Widely used HIN examples 4. Multiple-hub network a. Bioinformatics data
  19. 19. Example datasets Multiple HINs
  20. 20. Why Heterogeneous Information Network Analysis •It is a new development of data mining Big data analysis is an emergent yet important task to be studied Many different types of objects are interconnected HIN can be an effective tool to deal with complex big data. •It is an effective tool to fuse more information We can fuse information across multiple social network platforms •It contains rich semantics Different-typed objects and links coexist and they carry different meanings APA, APVPA, APV, etc...
  21. 21. Research Developments
  22. 22. Research Developments
  23. 23. Similarity measure ❏ Goal: consider both structure similarity of two objects and the meta path connecting two objects (e.g. APA, APVPA, etc) ❏ Path based similarity measure ❏ The relevance of different- typed objects ❏ meta path based relevance search + user preference different similarities according to meta paths (different semantic meanings) image-tag-image (based on common tags) image-tag-image- group-image-tag- image (further measured by shared groups) Sun, Yizhou, et al. "Pathsim: Meta path-based top-k similarity search in heterogeneous information networks." VLDB’11
  24. 24. Clustering ❏ Clustering based on networked data ❏ based on a homogeneous network (e.g. normalized cuts, modularity) ❏ need to consider multiple types of objects co-existing network
  25. 25. Clustering ❏ Integrate the attribute information ❏ based on the network structure, connections in the network and the vertex attributes ❏ Integrate the text information ❏ topic mining - a unified topic model with HIN ❏ multiple objects clustering Boden, B.,et al. "Density-Based Subspace Clustering in Heterogeneous Networks." Machine Learning and Knowledge Discovery in Databases (2014)Deng, Hongbo, et al. "Probabilistic topic models with biased propagation on heterogeneous information networks." Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 2011.
  26. 26. Clustering ❏ Integrate with mining tasks ❏ semi-supervised learning - path selection according to user guidance (labeled information) ❏ ranking-based clustering on HIN - mutual promotion of clustering and ranking ❏ Outlier detection ❏ detect association-based clique outliers in HIN ❏ find subnetwork outliers according to different queries and semantics ❏ a meta-path based outlier mining in HIN
  27. 27. Classification ❏ Classification in HIN ❏ classify multiple types of objects simultaneously ❏ the label of objects is decided by the effects of different-typed objects along different typed links ❏ Multi-label classification ❏ use multiple types of relationships mined from linkage structure of HIN ❏ Meta paths for feature generation ❏ Ranking-based classification ❏ mutually enhance classification and ranking knowledge propagation
  28. 28. Link prediction ❏ Challenges ❏ The links to be predicted are of different types ❏ Dependencies existing among multiple types of links ➔ collectively predict multiple types of links ➔ utilize meta paths ❏ Others ❏ Link prediction across multiple aligned heterogeneous networks ❏ Dynamic link prediction different link relations
  29. 29. Ranking ❏ Challenges ❏ treating all objects equally will mix different types of objects together ❏ different results under different meta paths(different semantic meanings) ❏ Meta-path based ranking ❏ simultaneously evaluate the importance of multiple types of objects and meta paths
  30. 30. Recommendation ❏ Meta path ❏ explore the semantics and extract relations among objects ❏ Can effectively fuse all kinds of information ❏ utilize different contexts ❏ use interest groups ❏ unified framework of multiple HIN features
  31. 31. Information fusion ❏ Across multiple aligned HINs ❏ via the shared common information entities ❏ A more comprehensive and consistent knowledge shared in different HINs using their structures, properties, and activities ❏ Information can reach more users and achieve broader influence ❏ Transferring knowledge between aligned networks ❏ e.g. overcome cold start problem in recommendation system
  32. 32. Advanced topics
  33. 33. More complex network construction ❏ Easy to construct HIN with well-defined schema ❏ From real data? ❏ objects and links can be noisy or not reliable ❏ duplicated names ❏ missing relations ❏ ... ❏ high-quality HINs by cleaning ❏ integrated with information extraction, NLP, and other techniques
  34. 34. More powerful mining methods ❏ Network structure Bipartie Star-schema Multiple-hub Weighted Dynamic Multiple-network Schema-rich
  35. 35. More powerful mining methods ❏ Semantic mining ❏ node/link semantics ❏ different-typed nodes/links have different semantics ❏ meta-path ❏ different similarities under different meta paths ❏ constrained meta-path ❏ constraint on node ❏ constraint on link APC APA APA|P.L = “Data Mining” APA|P.L = “Information Retrieval” …. weighted meta-path
  36. 36. More powerful mining methods
  37. 37. Bigger networked data ❏ can flexibly and effectively integrate varied objects and heterogeneous information ❏ However, many practical technique challenges in real HIN ❏ huge, dynamic, memory capacity .. ❏ Instead of whole network, hidden but small networks can be mined ❏ Quick/parallel computation strategies have been considered recently
  38. 38. Conclusion
  39. 39. Conclusion ❏ There is a surge on HIN in recent years because of rich structural and semantic information. ❏ The recent/future developments of different data mining tasks on HIN. ❏ An understanding of the fundamental issues and a good starting point to work on this field.
  40. 40. Thank you ! Q & A