This document discusses mining heterogeneous information networks. It begins by defining heterogeneous information networks as information networks containing multiple object and link types. It then discusses how heterogeneous networks are richer than homogeneous networks derived from them by projection. Several examples of heterogeneous networks are given, such as bibliographic, social media, and healthcare networks. The document outlines principles for mining heterogeneous networks, including using meta-paths to explore network structures and relationships. It introduces methods for ranking, clustering, and classifying nodes in heterogeneous networks, such as the RankClus and NetClus algorithms, which integrate ranking and clustering.
Tutorial at OAI5 (cern.ch/oai5). Abstract: This tutorial will provide a practical overview of current practices in modelling complex or compound digital objects. It will examine some of the key scenarios around creating complex objects and will explore a number of approaches to packaging and transport. Taking research papers, or scholarly works, as an example, the tutorial will explore the different ways in which these, and their descriptive metadata, can be treated as complex objects. Relevant application profiles and metadata formats will be introduced and compared, such as Dublin Core, in particular the DCMI Abstract Model, and MODS, alongside content packaging standards, such as METS MPEG 21 DIDL and IMS CP. Finally, we will consider some future issues and activities that are seeking to address these. The tutorial will be of interest to librarians and technical staff with an interest in metadata or complex objects, their creation, management and re-use.
JeromeDL is a social semantic digital library that allows users to easily publish and access resources online through metadata tagging and community sharing features. It integrates information from different metadata sources, provides interoperability between systems, and delivers more robust search interfaces powered by semantics. Resources are accessible by machines through rich metadata and the system involves the community in sharing knowledge through social features like comments, bookmarks, and user profiles.
Presented at the Northern Ohio Technical Services Librarians' meeting, November 22, 2013. Describes why libraries should move toward a linked data future to enable their resources to be discoverable on the open web, and includes lessons learned from developing the eXtensible Catalog at the University of Rochester.
Mining and Supporting Community Structures in Sensor Network ResearchMarko Rodriguez
The document discusses mining and supporting community structures in sensor network research. It summarizes a study that analyzed the co-authorship network of researchers at the Center for Embedded Network Sensing (CENS) to determine if structural communities detected in the network are independent of socio-academic communities like academic department or affiliation. The study found that structural communities correspond more to department and affiliation, while academic position and country of origin are independent of structural communities.
Simple Knowledge Organization System (SKOS) in the Context of Semantic Web De...gardensofmeaning
The document discusses the development and use of SKOS (Simple Knowledge Organization System) for representing knowledge organization systems like thesauri and classification schemes as structured data on the semantic web. It describes how LCSH (Library of Congress Subject Headings) has been converted to RDF using SKOS and published as linked open data. It suggests further steps like linking LCSH to other metadata and developing RDF representations of additional bibliographic schemas.
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
Tutorial at OAI5 (cern.ch/oai5). Abstract: This tutorial will provide a practical overview of current practices in modelling complex or compound digital objects. It will examine some of the key scenarios around creating complex objects and will explore a number of approaches to packaging and transport. Taking research papers, or scholarly works, as an example, the tutorial will explore the different ways in which these, and their descriptive metadata, can be treated as complex objects. Relevant application profiles and metadata formats will be introduced and compared, such as Dublin Core, in particular the DCMI Abstract Model, and MODS, alongside content packaging standards, such as METS MPEG 21 DIDL and IMS CP. Finally, we will consider some future issues and activities that are seeking to address these. The tutorial will be of interest to librarians and technical staff with an interest in metadata or complex objects, their creation, management and re-use.
JeromeDL is a social semantic digital library that allows users to easily publish and access resources online through metadata tagging and community sharing features. It integrates information from different metadata sources, provides interoperability between systems, and delivers more robust search interfaces powered by semantics. Resources are accessible by machines through rich metadata and the system involves the community in sharing knowledge through social features like comments, bookmarks, and user profiles.
Presented at the Northern Ohio Technical Services Librarians' meeting, November 22, 2013. Describes why libraries should move toward a linked data future to enable their resources to be discoverable on the open web, and includes lessons learned from developing the eXtensible Catalog at the University of Rochester.
Mining and Supporting Community Structures in Sensor Network ResearchMarko Rodriguez
The document discusses mining and supporting community structures in sensor network research. It summarizes a study that analyzed the co-authorship network of researchers at the Center for Embedded Network Sensing (CENS) to determine if structural communities detected in the network are independent of socio-academic communities like academic department or affiliation. The study found that structural communities correspond more to department and affiliation, while academic position and country of origin are independent of structural communities.
Simple Knowledge Organization System (SKOS) in the Context of Semantic Web De...gardensofmeaning
The document discusses the development and use of SKOS (Simple Knowledge Organization System) for representing knowledge organization systems like thesauri and classification schemes as structured data on the semantic web. It describes how LCSH (Library of Congress Subject Headings) has been converted to RDF using SKOS and published as linked open data. It suggests further steps like linking LCSH to other metadata and developing RDF representations of additional bibliographic schemas.
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
This presentation was provided by Tim McGeary of Duke University during the NISO virtual conference, Open Data Projects, held on Wednesday, June 13, 2018.
The document discusses various challenges in social network analysis including collecting and extracting network data at scale from sources such as the web, validating automated data extraction methods, and developing algorithms and software that can analyze large and complex network datasets. It also outlines different network analysis methods, visualization and simulation techniques, and recommendations for how tools can better support networking, referrals, and workflows across multiple data sources and programs. Scaling methods and algorithms to very large network sizes and developing standards to integrate diverse data and tools are highlighted as key challenges.
This document provides an overview of hypertext and intertext in reading and writing. It defines hypertext as non-linear text that uses links to allow readers to navigate between related pieces of information and create their own understanding. Hypertext is made possible by technologies like the World Wide Web and allows for multimedia integration. Intertext refers to the relationships between texts and how a text's meaning depends on its context. Reading and writing involves understanding intertextual connections and how authors develop arguments using evidence from other sources.
Slides to accompany Dr Louise Cooke's workshop session "An introduction to social network analysis" presented at DREaM Event 2.
For more information about the event, please visit http://lisresearch.org/dream-project/dream-event-2-workshop-tuesday-25-october-2011/
The document is a term paper submitted by Saurabh Singh to Cherry Khosla on types of multimedia tools used for information retrieval. It includes an acknowledgment thanking those who supported and guided the project. The paper contains an introduction, history of information retrieval, types of multimedia information retrieval systems, types of retrieval tools including image, text and audio analysis, limitations of multimedia information retrieval, new challenges, and conclusion. It provides references for further reading.
This document discusses dataset profiling and the LinkedUp data catalog. It describes how LinkedUp profiles 34 educational datasets, including information on their schemas, accessibility, and topic coverage. It also explains the benefits of dataset profiling, such as enabling federated querying and exploratory search over multiple datasets. Finally, it outlines techniques for profiling linked data and applications of the profiles through tools like Cite4Me and the LinkedUp data catalog.
The whitepaper addresses the challenges in the data–driven organizations, medical research and health care. It summarizes how the context-enabled and semantic enrichment can transform the traditional method to search optimum data. 3RDi has advanced content enrichment with Named Entity Recognition, Semantic similarity, Content classification and Content summarization. Get the right data at the right time that helps medical researchers and health care practitioners.
The document discusses knowledge organization systems (KOS) and how the Simple Knowledge Organization System (SKOS) bridges KOS and the Semantic Web. It provides examples of KOS like taxonomies and thesauruses and explains how they are used differently than ontologies. SKOS is defined as an RDF vocabulary for representing KOS online in a machine-readable way and became a W3C standard in 2009.
This document provides an overview of an information retrieval course. The course will cover topics related to information retrieval models, techniques, and systems. Students will complete exams, assignments, and a major project to build a search engine using both text-based and semantic retrieval techniques. The document defines key concepts in information retrieval and discusses different types of information retrieval systems and techniques.
A Survey on Decision Support Systems in Social MediaEditor IJCATR
Web 3.0 is the upcoming phase in web evolution. Web 3.0 will be about “feeding you the information that you want, when you want it” i.e. personalization of the web. In web 3.0 the basic principle is linking, integrating and analyzing data from various data sources into new information streams by means of semantic technology. So, we can say that Web 3.0 comprises of two platforms semantic technologies and social computing environment. Recommender system is a subclass of decision support system. Recommendations in social web are used to personalize the web [20]. Social Tagging System is one type of social media. In this paper we present the survey of various recommendations in Social Tagging Systems (STSs) like tag, item, user and unified recommendations along with semantic web and also discussed about major thrust areas of research in each category.
Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad
The document summarizes Peter Fox's presentation at the Now and Now for Data conference in Oxford, UK on May 22, 2013. Fox discusses different metaphors for making data publicly available, including data publication, ecosystems, and frameworks for conversations about data. He examines pros and cons of different approaches like data centers, publishers, and linked data. The presentation considers how to improve data sharing and what roles different stakeholders like producers and consumers play.
This document summarizes Pasquale Lops' presentation on semantics-aware content-based recommender systems. It discusses how content-based recommender systems traditionally rely on keyword profiles but have limitations due to issues like multi-word concepts, synonymy, and polysemy. The presentation proposes leveraging semantic text analytics techniques like word sense disambiguation, explicit semantic analysis using Wikipedia, and linked open data to move beyond keyword profiles and address issues like overspecialization and lack of serendipity in recommendations.
Research on ontology based information retrieval techniquesKausar Mukadam
The document summarizes and compares three novel ontology-based information retrieval techniques. It discusses a technique for retrieving information in the domain of Traditional Chinese Medicine that uses an ontology to represent concepts and measures concept similarity to sort search results. It also describes a framework for semantic indexing and querying that uses an ontology and entity-attribute-value model to improve scalability, usability, and retrieval performance for transport systems. Additionally, it outlines a semantic extension retrieval model that uses ontology annotation and semantic extension of queries to address limitations of keyword-based search. The techniques are evaluated based on precision and recall measures to analyze their effectiveness compared to traditional methods.
The document discusses analyzing the Web of Data (WoD) as a complex network at multiple scales. At the graph scale, the WoD contains over 100 nodes (datasets) connected by 350 edges. At the triple scale, a network of over 600,000 nodes and 800,000 edges was analyzed. Network analysis found the WoD exhibits properties like short average path lengths, power law degree distributions, and a few highly central nodes like DBpedia. Ongoing challenges include implicit links, multi-relations, and dynamics as data is continuously added.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)paperpublications3
Abstract: The main aim of this project is secure the user login and data sharing among the social networks like Gmail, Facebook and also find anonymous user using this networks. If the original user not available in the networks, but their friends or anonymous user knows their login details means possible to misuse their chats. In this project we have to overcome the anonymous user using the network without original user knowledge. Unauthorized user using the login to chat, share images or videos etc This is the problem to be overcome in this project .That means user first register their details with one secured question and answer. Because the anonymous user can delete their chat or data In this by using the secured questions we have to recover the unauthorized user chat history or sharing details with their IP address or MAC address. So in this project they have found out a way to prevent the anonymous users misuse the original user login details.
This document discusses using semantics and semantic triples to represent relationships between concepts and entities. Semantic triples use subjects, predicates, and objects to make assertions about relationships. These triples can be represented in various formats like RDF/XML, Turtle, and TriX. Semantic schemas like RDFS and OWL can define additional rules and relationships. SPARQL is then used to query the semantic graph. Combining semantics with search in MarkLogic allows documents to be flexibly joined and queried based on both their content and semantic relationships.
The Semantic Web in Digital Libraries: A Literature Reviewsstose
The document discusses the potential for semantic web technologies to improve digital libraries by enabling machines to understand relationships between terms, not just keyword mappings. It reviews literature on using ontologies and RDF to integrate metadata from different libraries and allow queries about authorship, synonyms, and other relationships rather than just searching for keywords. Adapting existing library metadata standards like Dublin Core to semantic web frameworks could improve discovery, collaboration, and interoperability for digital collections. However, significant organizational changes may be required within libraries to fully realize this vision.
Searching for patterns in crowdsourced informationSilvia Puglisi
This document introduces crowdsourcing and discusses discovering patterns in crowdsourced data. It discusses defining the context of volunteered information on the internet in order to understand relationships between data. A network model is proposed where different types of context define nodes and relationships between context determine edges. Properties of small world networks are discussed including how they could be used to model relationships between crowdsourced data and evaluate data quality. Finally, applications to search ranking, privacy and security are briefly mentioned.
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...Daniel Katz
This document provides an overview of complex systems models in social sciences, focusing on network analysis and community detection methods. It discusses key concepts like directed vs undirected networks, weighted vs unweighted edges, and overlapping vs non-overlapping communities. It also notes important considerations like network resolution, computational complexity, and how community detection results depend on the specific context and questions being examined. A variety of examples are provided, including social networks defined by friendships or voting coalitions.
An empirical examination of the structure of scholarship in the Society for the Psychological Study of Social Issues (SPSSI) grounded in network analyses of shared citations (bibliographic couplings)
A survey of heterogeneous information network analysisSOYEON KIM
A Survey of Heterogeneous Information Network Analysis
Chuan Shi, Member, IEEE,
Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE,
and Philip S. Yu, Fellow, IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015
The document discusses various challenges in social network analysis including collecting and extracting network data at scale from sources such as the web, validating automated data extraction methods, and developing algorithms and software that can analyze large and complex network datasets. It also outlines different network analysis methods, visualization and simulation techniques, and recommendations for how tools can better support networking, referrals, and workflows across multiple data sources and programs. Scaling methods and algorithms to very large network sizes and developing standards to integrate diverse data and tools are highlighted as key challenges.
This document provides an overview of hypertext and intertext in reading and writing. It defines hypertext as non-linear text that uses links to allow readers to navigate between related pieces of information and create their own understanding. Hypertext is made possible by technologies like the World Wide Web and allows for multimedia integration. Intertext refers to the relationships between texts and how a text's meaning depends on its context. Reading and writing involves understanding intertextual connections and how authors develop arguments using evidence from other sources.
Slides to accompany Dr Louise Cooke's workshop session "An introduction to social network analysis" presented at DREaM Event 2.
For more information about the event, please visit http://lisresearch.org/dream-project/dream-event-2-workshop-tuesday-25-october-2011/
The document is a term paper submitted by Saurabh Singh to Cherry Khosla on types of multimedia tools used for information retrieval. It includes an acknowledgment thanking those who supported and guided the project. The paper contains an introduction, history of information retrieval, types of multimedia information retrieval systems, types of retrieval tools including image, text and audio analysis, limitations of multimedia information retrieval, new challenges, and conclusion. It provides references for further reading.
This document discusses dataset profiling and the LinkedUp data catalog. It describes how LinkedUp profiles 34 educational datasets, including information on their schemas, accessibility, and topic coverage. It also explains the benefits of dataset profiling, such as enabling federated querying and exploratory search over multiple datasets. Finally, it outlines techniques for profiling linked data and applications of the profiles through tools like Cite4Me and the LinkedUp data catalog.
The whitepaper addresses the challenges in the data–driven organizations, medical research and health care. It summarizes how the context-enabled and semantic enrichment can transform the traditional method to search optimum data. 3RDi has advanced content enrichment with Named Entity Recognition, Semantic similarity, Content classification and Content summarization. Get the right data at the right time that helps medical researchers and health care practitioners.
The document discusses knowledge organization systems (KOS) and how the Simple Knowledge Organization System (SKOS) bridges KOS and the Semantic Web. It provides examples of KOS like taxonomies and thesauruses and explains how they are used differently than ontologies. SKOS is defined as an RDF vocabulary for representing KOS online in a machine-readable way and became a W3C standard in 2009.
This document provides an overview of an information retrieval course. The course will cover topics related to information retrieval models, techniques, and systems. Students will complete exams, assignments, and a major project to build a search engine using both text-based and semantic retrieval techniques. The document defines key concepts in information retrieval and discusses different types of information retrieval systems and techniques.
A Survey on Decision Support Systems in Social MediaEditor IJCATR
Web 3.0 is the upcoming phase in web evolution. Web 3.0 will be about “feeding you the information that you want, when you want it” i.e. personalization of the web. In web 3.0 the basic principle is linking, integrating and analyzing data from various data sources into new information streams by means of semantic technology. So, we can say that Web 3.0 comprises of two platforms semantic technologies and social computing environment. Recommender system is a subclass of decision support system. Recommendations in social web are used to personalize the web [20]. Social Tagging System is one type of social media. In this paper we present the survey of various recommendations in Social Tagging Systems (STSs) like tag, item, user and unified recommendations along with semantic web and also discussed about major thrust areas of research in each category.
Fox-Keynote-Now and Now of Data Publishing-nfdp13DataDryad
The document summarizes Peter Fox's presentation at the Now and Now for Data conference in Oxford, UK on May 22, 2013. Fox discusses different metaphors for making data publicly available, including data publication, ecosystems, and frameworks for conversations about data. He examines pros and cons of different approaches like data centers, publishers, and linked data. The presentation considers how to improve data sharing and what roles different stakeholders like producers and consumers play.
This document summarizes Pasquale Lops' presentation on semantics-aware content-based recommender systems. It discusses how content-based recommender systems traditionally rely on keyword profiles but have limitations due to issues like multi-word concepts, synonymy, and polysemy. The presentation proposes leveraging semantic text analytics techniques like word sense disambiguation, explicit semantic analysis using Wikipedia, and linked open data to move beyond keyword profiles and address issues like overspecialization and lack of serendipity in recommendations.
Research on ontology based information retrieval techniquesKausar Mukadam
The document summarizes and compares three novel ontology-based information retrieval techniques. It discusses a technique for retrieving information in the domain of Traditional Chinese Medicine that uses an ontology to represent concepts and measures concept similarity to sort search results. It also describes a framework for semantic indexing and querying that uses an ontology and entity-attribute-value model to improve scalability, usability, and retrieval performance for transport systems. Additionally, it outlines a semantic extension retrieval model that uses ontology annotation and semantic extension of queries to address limitations of keyword-based search. The techniques are evaluated based on precision and recall measures to analyze their effectiveness compared to traditional methods.
The document discusses analyzing the Web of Data (WoD) as a complex network at multiple scales. At the graph scale, the WoD contains over 100 nodes (datasets) connected by 350 edges. At the triple scale, a network of over 600,000 nodes and 800,000 edges was analyzed. Network analysis found the WoD exhibits properties like short average path lengths, power law degree distributions, and a few highly central nodes like DBpedia. Ongoing challenges include implicit links, multi-relations, and dynamics as data is continuously added.
Graph mining analyzes structured data like social networks and the web through graph search algorithms. It aims to find frequent subgraphs using Apriori-based or pattern growth approaches. Social networks exhibit characteristics like densification and heavy-tailed degree distributions. Link mining analyzes heterogeneous, multi-relational social network data through tasks like link prediction and group detection, facing challenges of logical vs statistical dependencies and collective classification. Multi-relational data mining searches for patterns across multiple database tables, including multi-relational clustering that utilizes information across relations.
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)paperpublications3
Abstract: The main aim of this project is secure the user login and data sharing among the social networks like Gmail, Facebook and also find anonymous user using this networks. If the original user not available in the networks, but their friends or anonymous user knows their login details means possible to misuse their chats. In this project we have to overcome the anonymous user using the network without original user knowledge. Unauthorized user using the login to chat, share images or videos etc This is the problem to be overcome in this project .That means user first register their details with one secured question and answer. Because the anonymous user can delete their chat or data In this by using the secured questions we have to recover the unauthorized user chat history or sharing details with their IP address or MAC address. So in this project they have found out a way to prevent the anonymous users misuse the original user login details.
This document discusses using semantics and semantic triples to represent relationships between concepts and entities. Semantic triples use subjects, predicates, and objects to make assertions about relationships. These triples can be represented in various formats like RDF/XML, Turtle, and TriX. Semantic schemas like RDFS and OWL can define additional rules and relationships. SPARQL is then used to query the semantic graph. Combining semantics with search in MarkLogic allows documents to be flexibly joined and queried based on both their content and semantic relationships.
The Semantic Web in Digital Libraries: A Literature Reviewsstose
The document discusses the potential for semantic web technologies to improve digital libraries by enabling machines to understand relationships between terms, not just keyword mappings. It reviews literature on using ontologies and RDF to integrate metadata from different libraries and allow queries about authorship, synonyms, and other relationships rather than just searching for keywords. Adapting existing library metadata standards like Dublin Core to semantic web frameworks could improve discovery, collaboration, and interoperability for digital collections. However, significant organizational changes may be required within libraries to fully realize this vision.
Searching for patterns in crowdsourced informationSilvia Puglisi
This document introduces crowdsourcing and discusses discovering patterns in crowdsourced data. It discusses defining the context of volunteered information on the internet in order to understand relationships between data. A network model is proposed where different types of context define nodes and relationships between context determine edges. Properties of small world networks are discussed including how they could be used to model relationships between crowdsourced data and evaluate data quality. Finally, applications to search ranking, privacy and security are briefly mentioned.
ICPSR - Complex Systems Models in the Social Sciences - Lecture 4 - Professor...Daniel Katz
This document provides an overview of complex systems models in social sciences, focusing on network analysis and community detection methods. It discusses key concepts like directed vs undirected networks, weighted vs unweighted edges, and overlapping vs non-overlapping communities. It also notes important considerations like network resolution, computational complexity, and how community detection results depend on the specific context and questions being examined. A variety of examples are provided, including social networks defined by friendships or voting coalitions.
An empirical examination of the structure of scholarship in the Society for the Psychological Study of Social Issues (SPSSI) grounded in network analyses of shared citations (bibliographic couplings)
A survey of heterogeneous information network analysisSOYEON KIM
A Survey of Heterogeneous Information Network Analysis
Chuan Shi, Member, IEEE,
Yitong Li, Jiawei Zhang, Yizhou Sun, Member, IEEE,
and Philip S. Yu, Fellow, IEEE
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
Linked Open Data promises to provide guiding principles to publish interlinked knowledge graphs on the Web in the form of findable, accessible, interoperable, and reusable datasets. In this talk I argue that while as such, Linked Data may be viewed as a basis for instantiating the FAIR principles, there are still a number of open issues that cause significant data quality issues even when knowledge graphs are published as Linked Data. In this talk I will first define the boundaries of what constitutes a single coherent knowledge graph within Linked Data, i.e., present a principled notion of what a dataset is and what links within and between datasets are. I will also define different link types for data in Linked datasets and present the results of our empirical analysis of linkage among the datasets of the Linked Open Data cloud. Recent results from our analysis of Wikidata, which has not been part of the Linked Open Data Cloud, will also be presented.
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisXanat V. Meza
Disclaimer: All original texts and images belong to their rightful owners.
Chapter 8 of the book "Bibliometrics and Citation Analysis" by Nicola de Bellis.
● Quantum Algorithm of Imperfect KB Self-organization Pt I: Smart Control-Information-Thermodynamic Bounds
● A New Approach of Intelligent Data Retrieval Paradigm
● Connected and Autonomous Vehicles (CAVs) Challenges with Nonmotorized Amenities Environments
● Quantum Algorithm of Imperfect KB Self-organization. Pt II: Robotic Control with Remote Knowledge Base Exchange
● Learning in AI Processor
Linked Data and Users in Library - Does the library communicate efficiently?Hansung University
1. This document discusses the principles of linked data and how it can be applied in libraries to better communicate with changing user needs.
2. Tim Berners-Lee's four principles of linked data are outlined, which involve using URIs to identify objects and linking objects to other URIs to discover more data. Library data like bibliographic records are well-suited for this approach.
3. User needs are changing rapidly due to advancing technologies and users want information delivered through social media. Libraries must improve communication through approaches like linked data to remain relevant to users.
What is Linked Data, and What Does It Mean for Libraries?Emily Nimsakont
1) The document discusses the concept of Linked Data and how it differs from the traditional web by making relationships between data explicit through the use of URIs and RDF.
2) Linked Data could benefit libraries by allowing library data to be more openly connected and standardized, dissolving the boundaries between individual bibliographic records.
3) There are already some library Linked Data projects underway like the Library of Congress Authorities and Vocabularies. Linked Data may change library workflows and the role of catalogers.
Semantic Web Technologies: Changing Bibliographic Descriptions?Stuart Weibel
Keynote presentation at the North Atlantic Health Science Library meeting, October 26, 2009.
An introduction to semantic web technologies and their relationship to libraries and bibliographic data.
Stuart Weibel, Senior Research Scientist, OCLC Research
This poster provides referencing services to linking bibliographical papers and citations with existing Linked Open Data. It aims to convert current bibliographical data in various digital library databases into semantic bibliographical data to enable research profiling and intelligent knowledge discovery
Iaetsd similarity search in information networks usingIaetsd Iaetsd
The document proposes a novel meta-path based similarity measure called PathSim to find similar peer objects in heterogeneous information networks. PathSim captures peer similarity by measuring how strongly connected two objects are as well as how comparable their visibility is in the network. An efficient algorithm is also introduced to support online top-k queries for meta-path based similarity search using partial materialization and co-clustering based pruning. Experimental results on bibliographic networks extracted from DBLP and Flickr demonstrate the effectiveness of PathSim and efficiency of the proposed algorithms.
Entity linking with a knowledge base issues techniques and solutionsCloudTechnologies
The document discusses entity linking, which is the task of linking entity mentions in text to corresponding entries in a knowledge base. It presents challenges like name variations and ambiguity. The paper then surveys main approaches to entity linking and discusses applications like knowledge base population and question answering. It also covers evaluation of entity linking systems and directions for future work.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Laurent Alquier
The document discusses using a collaborative linked data framework to explore a data landscape. It describes how the framework helps scientists access and integrate disparate data sources to answer translational research questions. Key components of the framework include a semantic wiki for cataloging data sources, linking data concepts, querying across sources, and visualizing relationships between sources. The goal is to provide scientists with flexible tools to discover and leverage relevant data without needing expertise in data management.
This document summarizes a presentation on using bibliographic couplings to analyze the structure of a large public university. The presentation analyzed co-citation networks between university scholars and the papers they cited to identify overlapping intellectual communities across disciplines. It identified key individuals who bridge communities and act as knowledge conduits. The analysis found that the network of engaged scholars bears little resemblance to the existing academic units, suggesting opportunities to restructure the university to better support interdisciplinary work.
This document provides an overview of social network analysis, including key concepts, analytic techniques, and examples of classic studies. It discusses the basic components of social networks like actors, ties, and relationships. It also describes different types of networks and measures used in social network analysis, such as degree centrality and betweenness centrality. Finally, it highlights some influential early social network analysis studies and resources for further information.
Network theory examines how relationships between individuals (nodes) connected by interactions (edges) impact organizations. Key concepts include centrality, which measures how well-connected a node is; cliques, which are densely connected subgroups; and structural holes, which are brokers between separate groups. Applying network theory can provide insights into how relationships influence information sharing, collaboration, and social capital within organizations.
Presentación del Dr. Getaneh Alemu (Solent University, Reino Unido), en el II Congreso de Información, Comunicación e Investigación (CICI 2018) “Metadatos y Organización de la Información”. Facultad de Filosofía y Letras de la Universidad Autónoma de Chihuahua, México. Evento organizado por el Cuerpo Académico 'Estudios de la Información' y el Grupo Disciplinar ‘Información, Lenguaje, Comunicación y Desarrollo Sostenible’. 29 de octubre de 2018.
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
The widespread utilization of cloud computing services
has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses, the future status of virtual machines should be monitored in real time and predicted before they crash. However, most existing methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status prediction:
1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
Latent Factor Model (LFM) is extensively used in
dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically
mine the latent user interests and item topics from user-item
bipartite networks. In particular, we introduce the motivation
and objectives of this bipartite network based approach, and
detail the model development and optimization process of the
proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITM’s efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
Web Service QoS Prediction Approach in Mobile Internet Environmentsjins0618
Existing many Web service QoS prediction
approaches are very accurate in Internet environments,
however they cannot provide accurate prediction values in
Mobile Internet environments since QoS values of Web
services have great volatility. In this paper, we propose an
accurate Web service QoS prediction approach by weakening
the volatility of QoS data from Web services in Mobile Internet
environments. This approach contains three process, i.e., QoS
preprocessing, user similarity computing, and QoS predicting.
We have implemented our proposed approach with experiment
based on real world and synthetic datasets. The results show
that our approach outperforms other approaches in Mobile
Internet environments.
This document outlines a course on mining heterogeneous information networks. The course will cover phrase mining and topic modeling from large text corpora, entity extraction and typing through relational graph construction and propagation, as well as mining and constructing heterogeneous information networks.
Christian jensen advanced routing in spatial networks using big datajins0618
Advanced Routing in Spatial Networks Using Big Data discusses using big data and advanced routing techniques for transportation networks. It covers modeling transportation networks using big data from sensors to assign time-varying weights representing factors like travel time and emissions. It then discusses routing algorithms that find optimal routes considering these weights, including algorithms for stochastic and uncertain weights. The document provides an overview of using big data to improve transportation network modeling and routing.
This document discusses influenceability estimation in social networks. It describes the independent cascade model of influence diffusion, where each node has an independent probability of influencing its neighbors. The problem is to estimate the expected number of nodes reachable from a given seed node. The document presents the naive Monte Carlo (NMC) approach, which samples possible graphs and averages the number of reachable nodes over the samples. While NMC provides an unbiased estimator, it has high variance. The document aims to reduce the variance to improve estimation accuracy.
Calton pu experimental methods on performance in cloud and accuracy in big da...jins0618
Experimental Methods on Performance in Clouds provides an overview of cloud computing and benchmarks for measuring cloud performance. It discusses:
1) The evolution of cloud computing from early data centers to modern cloud platforms like Amazon Web Services.
2) Examples of cloud workloads and benchmarks used to test performance, including RUBiS for e-commerce and RUBBoS for bulletin boards.
3) The challenges of modeling and measuring cloud performance at scale due to the large number of variable configurations, and the need for automation through frameworks like Expertus.
This document discusses challenges and opportunities in parallel graph processing for big data. It describes how graphs are ubiquitous but processing large graphs at scale is difficult due to their huge size, complex correlations between data entities, and skewed distributions. Current computation models have problems with ghost vertices, too much interaction between partitions, and lack of support for iterative graph algorithms. New frameworks are needed to handle these graphs in a scalable way with low memory usage and balanced computation and communication.
This document discusses challenges in processing large graphs and introduces an approach called GraphLego. It describes how GraphLego models large graphs as 3D cubes partitioned into slices, strips and dices to balance parallel computation. GraphLego optimizes access locality by minimizing disk access and compressing partitions. It also uses regression-based learning to optimize partitioning parameters and runtime. The document evaluates GraphLego on real-world graphs, finding it outperforms existing single-machine graph processing systems in execution efficiency and partitioning decisions.
Wang ke mining revenue-maximizing bundling configurationjins0618
This document presents algorithms for mining revenue-maximizing bundling configurations from consumer preference data. It discusses how willingness to pay for items can be estimated from online ratings data. The bundle configuration problem of grouping items into bundles to maximize total revenue is formulated and shown to be NP-hard for bundles of size 3 or more. Heuristic algorithms based on graph matching and greedy approaches are proposed to solve the problem approximately. The algorithms are evaluated on a real dataset of Amazon book ratings, demonstrating increased revenue from bundling over selling items individually.
Wang ke classification by cut clearance under thresholdjins0618
This document proposes a new classification method called CUT (Classification Under Threshold) that partitions data into groups that are either "cleared" or "not cleared" based on a user-specified threshold. The goal is to maximize the number of future cases that can be cleared without intervention. It was tested on problems like predicting transformers with carcinogenic PCBs. Experimental results found CUT outperformed other methods by clearing more non-hazardous cases while keeping errors under the threshold.
The document summarizes an entity extraction and typing framework proposed by the author. The framework constructs a heterogeneous graph connecting entity mentions, surface names, and relation phrases extracted from documents. It then performs joint type propagation and relation phrase clustering on the graph to infer types for entity mentions. Evaluation on news, tweets and reviews shows the framework outperforms existing methods in recognizing new types and domains without extensive feature engineering or human supervision. It obtains improvements by modeling each mention individually and addressing data sparsity through relation phrase clustering.
Strategy 3 for topical phrase mining first performs phrase mining on a corpus to extract candidate phrases, then applies topic modeling with the phrases as constraints. This approach generates coherent topics where words within a phrase share a topic label. Strategy 3 outputs high-quality topics and phrases faster than Strategies 1 and 2, and the topics and phrases have better coherence, quality, and intrusion scores. However, Strategy 3's topic inference relies on the accuracy of the initial phrase mining.
This document outlines a course on mining heterogeneous information networks. The course will cover phrase mining and topic modeling from large text corpora, entity extraction and typing through relational graph construction and propagation, as well as mining and constructing heterogeneous information networks.
The document summarizes a talk on analyzing the truthfulness of web data. It discusses analyzing structured data from two domains - stock prices and flight statuses - from multiple online sources. It was found that there is a lot of redundant data across sources, but also significant inconsistencies. Various techniques were explored to resolve inconsistencies and find true values, such as voting methods that leverage source accuracy and ignore copied data. Detecting copying between sources is important for improving truth finding but also computationally challenging. Scaling up copy detection methods can help address this challenge.
The document discusses techniques for creating small summaries of big data in order to improve computational scalability. It introduces sketch structures as a class of linear summaries that can be merged and updated efficiently. Specific sketch structures discussed include Bloom filters, Count-Min sketches, and Count sketches. It also covers counter-based summaries like the heavy hitters algorithm for finding frequent items in a data stream. The document outlines the structures, analysis, and applications of these various techniques for creating concise summaries of large datasets.
Gao cong geospatial social media data management and context-aware recommenda...jins0618
The document discusses geospatial social media data management and context-aware recommendation. It introduces technologies for geo-positioning users and content, and how user generated content from social media is increasingly associated with geo-locations. The document then outlines queries for static geo-textual data, publish/subscribe queries on geo-textual data streams, and personalized, context-aware point-of-interest recommendation based on modeling user behavior from geo-textual data.
Chengqi zhang graph processing and mining in the era of big datajins0618
The document discusses challenges and opportunities in graph processing and mining for big data, including new graph semantics, mining tasks, query processing algorithms, indexing techniques, and computing models needed to handle large graph datasets. It outlines the speaker's work on topics such as structural keyword search, graph matching, community detection, and graph classification. It also covers the speaker's work on efficient graph algorithms, parallel and distributed graph computation, and graph processing system design.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
3. 3
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
4. 4
Ubiquitous Information Networks
Graphs and substructures: Chemical compounds, visual objects, circuits, XML
Biological networks
Bibliographic networks: DBLP, ArXiv, PubMed, …
Social networks: Facebook >100 million active users
World Wide Web (WWW): > 3 billion nodes, > 50 billion arcs
Cyber-physical networks
Yeast protein
interaction networkWorld-Wide Web Co-author network Social network sites
5. 5
What Are Heterogeneous Information Networks?
Information network: A network where each node represents an entity (e.g., actor in a
social network) and each link (e.g., tie) a relationship between entities
Each node/link may have attributes, labels, and weights
Link may carry rich semantic information
Homogeneous vs. heterogeneous networks
Homogeneous networks
Single object type and single link type
Single model social networks (e.g., friends)
WWW: A collection of linked Web pages
Heterogeneous, multi-typed networks
Multiple object and link types
Medical network: Patients, doctors, diseases, contacts, treatments
Bibliographic network: Publications, authors, venues (e.g., DBLP > 2 million papers)
6. 6
Homogeneous vs. Heterogeneous Networks
Conference-Author Network
SIGMOD
SDM
ICDM
KDD
EDBT
VLDB
ICML
AAAI
Tom
Jim
Lucy
Mike
Jack
Tracy
Cindy
Bob
Mary
Alice
Co-author Network
7. 7
Mining Heterogeneous Information Networks
Homogeneous networks can often be derived from their original
heterogeneous networks
Ex. Coauthor networks can be derived from author-paper-conference
networks by projection on authors
Paper citation networks can be derived from a complete bibliographic
network with papers and citations projected
Heterogeneous networks carry richer information than their corresponding
projected homogeneous networks
Typed heterogeneous network vs. non-typed heterogeneous network (i.e.,
not distinguishing different types of nodes)
Typed nodes and links imply more structures, leading to richer discovery
Mining semi-structured heterogeneous information networks
Clustering, ranking, classification, prediction, similarity search, etc.
8. 8
Examples of Heterogeneous Information Networks
Bibliographic information networks: DBLP, ArXive, PubMed
Entity types: paper (P), venue (V), author (A), and term (T)
Relation type: authors write papers, venues publish papers, papers contain terms
Twitter information network, and other social media network
Objects types: user, tweet, hashtag, and term
Relation/link types: users follow users, users post tweets, tweets reply tweets,
tweets use terms, tweets contain hashtags
Flickr information network
Object types: image, user, tag, group, and comment
Relation types: users upload images, image contain tags, images belong to groups,
users post comments, and comments comment on images
Healthcare information network
Object types: doctor, patient, disease, treatment, and device
Relation types: treatments used-for diseases, patients have diseases, patients visit
doctors
9. 9
Structures Facilitate Mining Heterogeneous Networks
Network construction: generates structured networks from unstructured text data
Each node: an entity; each link: a relationship between entities
Each node/link may have attributes, labels, and weights
Heterogeneous, multi-typed networks: e.g., Medical network: Patients, doctors,
diseases, contacts, treatments
Venue Paper Author
DBLP Bibliographic Network The IMDB Movie Network
Actor
Movie
Director
Movie
Studio
The Facebook Network
10. 10
What Can be Mined from Heterogeneous Networks?
A homogeneous network can be derived from its “parent” heterogeneous network
Ex. Coauthor networks from the original author-paper-conference networks
Heterogeneous networks carry richer info. than the projected homogeneous networks
Typed nodes & links imply more structures, leading to richer discovery
Ex.: DBLP: A Computer Science bibliographic database (network)
Knowledge hidden in DBLP Network Mining Functions
Who are the leading researchers on Web search? Ranking
Who are the peer researchers of Jure Leskovec? Similarity Search
Whom will Christos Faloutsos collaborate with? Relationship Prediction
Which relationships are most influential for an author to decide her topics? Relation Strength Learning
How was the field of Data Mining emerged or evolving? Network Evolution
Which authors are rather different from his/her peers in IR? Outlier/anomaly detection
11. 11
Principles of Mining Heterogeneous Information Net
Information propagation across heterogeneous nodes & links
How to compute ranking scores, similarity scores, and clusters, and how to make
good use of class labels, across heterogeneous nodes and links
Objects in the networks are interdependent and knowledge can only be mined
using the holistic information in a network
Search and mining by exploring network meta structures
Heter. info networks: semi-structured and typed
Network schema: a meta structure, guidance of search and mining
Meta-path based similarity search and mining
User-guided exploration of information networks
Automatically select the right relation (or meta-path) combinations with
appropriate weights for a particular search or mining task
User-guided or feedback-based network exploration is a strategy
13. 13
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
14. 14
Ranking-Based Clustering in Heterogeneous Networks
Clustering and ranking: Two critical functions in data mining
Clustering without ranking? Think about no PageRank dark time before Google
Ranking will make more sense within a particular cluster
Einstein in physics vs. Turing in computer science
Why not integrate ranking with clustering & classification?
High-ranked objects should be more important in a cluster than low-ranked ones
Why treat every object the same weight in the same cluster?
But how to get their weight?
Integrate ranking with clustering/classification in the same process
Ranking, as the feature, is conditional (i.e., relative) to a specific cluster
Ranking and clustering may mutually enhance each other
Ranking-based clustering: RankClus [EDBT’09], NetClus [KDD’09]
15. 15
A Bi-Typed Network Model and Simple Ranking
A bi-typed network model
Let X represents type venue
Y: Type author
The DBLP network can be represented as matrix W
Our task: Rank-based clustering of heterogeneous network W
Simple Ranking
Proportional to # of publications of an author and a venue
Considers only immediate neighborhood in the network
But what about an author publishing many papers only in very weak venues?
A bi-typed network
XX XY
YX YY
W W
W
W W
16. 16
The RankClus Methodology
Ranking as the feature of the cluster
Ranking is conditional on a specific cluster
E.g., VLDB’s rank in Theory vs. its rank in the DB area
The distributions of ranking scores over objects are different in each cluster
Clustering and ranking are mutually enhanced
Better clustering: Rank distributions for clusters are more distinguishing from
each other
Better ranking: Better metric for objects is learned from the ranking
Not every object should be treated equally in clustering!
Y. Sun, et al., “RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis”, EDBT'09
17. 17
Simple Ranking vs. Authority Ranking
Simple Ranking
Proportional to # of publications of an author / a conference
Considers only immediate neighborhood in the network
Authority Ranking:
More sophisticated “rank rules” are needed
Propagate the ranking scores in the network over different types
What about an author publishing 100 papers in very weak conferences?
18. 18
Authority Ranking
Methodology: Propagate the ranking scores in the network over different types
Rule 1: Highly ranked authors publish many papers in highly ranked venues
Rule 2: Highly ranked venues attract many papers from many highly ranked authors
Rule 3: The rank of an author is enhanced if he or she co-authors with many highly
ranked authors
Other ranking functions are quite possible (e.g., using domain knowledge)
Ex. Journals may weight more than conferences in science
19. 19
Alternative Ranking Functions
A ranking function is not only related to the link property, but also depends on
domain knowledge
Ex: Journals may weight more than conferences in science
Ranking functions can be defined on multi-typed networks
Ex: PopRank takes into account the impact both from the same type of objects
and from the other types of objects, with different impact factors for different
types
Use expert knowledge, for example,
TrustRank semi-automatically separates reputable, good objects from spam ones
Personalized PageRank uses expert ranking as query and generates rank
distributions w.r.t. such knowledge
A research problem that needs systematic study
20. 20
The EM (Expectation Maximization) Algorithm
The k-means algorithm has two steps at each iteration (in the E-M framework):
Expectation Step (E-step): Given the current cluster centers, each object is
assigned to the cluster whose center is closest to the object: An object is expected
to belong to the closest cluster
Maximization Step (M-step): Given the cluster assignment, for each cluster, the
algorithm adjusts the center so that the sum of distance from the objects assigned
to this cluster and the new center is minimized
The (EM) algorithm: A framework to approach maximum likelihood or maximum a
posteriori estimates of parameters in statistical models.
E-step assigns objects to clusters according to the current probabilistic clustering
or parameters of probabilistic clusters
M-step finds the new clustering or parameters that minimize the sum of squared
error (SSE) or maximize the expected likelihood
21. 21
From Conditional Rank Distribution to E-M Framework
Given a bi-typed bibliographic network, how can we use the conditional rank scores
to further improve the clustering results?
Conditional rank distribution as cluster feature
For each cluster Ck, the conditional rank scores, rX|CK and rY|CK, can be viewed as
conditional rank distributions of X and Y, which are the features for cluster Ck
Cluster membership as object feature
From p(k|oi) ∝ p(oi|k)p(k), the higher its conditional rank in a cluster (p(oi|k)), the
higher possibility an object will belong to that cluster (p(k|oi))
Highly ranked attribute object has more impact on determining the cluster
membership of a target object
Parameter estimation using the Expectation-Maximization algorithm
E-step: Calculate the distribution p(z = k|yj, xi, Θ) based on the current value of Θ
M-Step: Update Θ according to the current distribution
22. 22
RankClus: Integrating Clustering with Ranking
An EM styled Algorithm
Initialization
Randomly partition
Repeat
Ranking
Ranking objects in each sub-
network induced from each cluster
Generating new measure space
Estimate mixture model
coefficients for each target object
Adjusting cluster
Until change < threshold
SIGMOD
SDM
ICDM
KDD
EDBT
VLDB
ICML
AAAI
Tom
Jim
Lucy
Mike
Jack
Tracy
Cindy
Bob
Mary
Alice
SIGMOD
VLDB
EDBT
KDDICDM
SDM
AAAI
ICML
Objects
Ranking
Sub-Network
Ranking
Clustering
RankClus [EDBT’09]: Ranking and clustering mutually
enhancing each other in an E-M framework
An E-M framework for iterative enhancement
23. 23
Step-by-Step Running of RankClus
Initially, ranking
distributions are mixed
together
Two clusters of objects
mixed together, but
preserve similarity
somehow
Improved a little
Two clusters are almost
well separated
Improved significantly
Stable
Well separated
Clustering and ranking
of two fields: DB/DM
vs. HW/CA (hardware/
computer architecture)
Stable
24. 24
Clustering Performance (NMI)
Compare the clustering accuracy: For N objects, K clusters, and two clustering
results, let n(i, j ), be # objects with cluster label i in the 1st clustering result (say
generated by the new alg.) and that w. cluster label j in the 2nd clustering result (say
the ground truth)
Normalized Mutual Info. (NMI):
joint distribution: p(i, j) = n(i,j )/N
row distr. p1(j) = , column distr.
D1: med. separated & med. density
D2: med. separated & low density
D3: med. Separated & high density
D4: highly separated & med. density
D5: poorly separated & med. density
25. 25
RankClus: Clustering & Ranking CS Venues in DBLP
Top-10 conferences in 5 clusters using RankClus in DBLP (when k = 15)
26. 26
Time Complexity: Linear to # of Links
At each iteration, |E|: edges in network, m: # of target objects, K: # of clusters
Ranking for sparse network
~O(|E|)
Mixture model estimation
~O(K|E|+mK)
Cluster adjustment
~O(mK^2)
In all, linear to |E|
~O(K|E|)
Note: SimRank will be at least quadratic at each iteration since it evaluates distance
between every pair in the network
Comparison of algorithms on execution time
27. 27
NetClus: Ranking-Based Clustering with Star Network Schema
Beyond bi-typed network: Capture more semantics with multiple types
Split a network into multi-subnetworks, each a (multi-typed) net-cluster [KDD’09]
Research
Paper
Term
AuthorVenue
Publish Write
Contain
P
T
AV
P
T
AV
……
P
T
AV
NetClus
Computer Science
Database
Hardware
Theory
DBLP network: Using terms, venues, and authors
to jointly distinguish a sub-field, e.g., database
28. 28
The NetClus Algorithm
Generate initial partitions for target objects and induce initial net-clusters
from the original network
Repeat // An E-M Framework
Build ranking-based probabilistic generative model for each net-cluster
Calculate the posterior probabilities for each target object
Adjust their cluster assignment according to the new measure defined by
the posterior probabilities to each cluster
Until the clusters do not change significantly
Calculate the posterior probabilities for each attribute object in each net-
cluster
29. 29
NetClus on the DBLP Network
NetClus initialization: G = (V, E, W), weight wxixj linking xi and xj
V = A U C U D U T, where D (paper), A (author), C (conf.), T(term)
Simple ranking:
Authority ranking for type Y based on type X, through the center type Z:
For DBLP:
30. 30
Multi-Typed Networks Lead to Better Results
The network cluster for database area: Conferences, Authors, and Terms
NetClus leads to better clustering and ranking than RankClus
NetClus vs. RankClus: 16% higher accuracy on conference clustering
32. 32
NetClus: Experiment on DBLP: Database System Cluster
NetClus generates high quality
clusters for all of the three
participating types in the
DBLP network
Quality can be easily judged
by our commonsense
knowledge
Highly-ranked objects: Objects
centered in the cluster
database 0.0995511
databases 0.0708818
system 0.0678563
data 0.0214893
query 0.0133316
systems 0.0110413
queries 0.0090603
management 0.00850744
object 0.00837766
relational 0.0081175
processing 0.00745875
based 0.00736599
distributed 0.0068367
xml 0.00664958
oriented 0.00589557
design 0.00527672
web 0.00509167
information 0.0050518
model 0.00499396
efficient 0.00465707
Surajit Chaudhuri 0.00678065
Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769
C. Mohan 0.00528346
David J. DeWitt 0.00491615
Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289
David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064
Jeffrey F. Naughton 0.00363698
Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911
Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047
Michael J. Franklin 0.00304099
Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495
SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943
EDBT 0.0436849
Term Venue Author
33. 33
Rank-Based Clustering: Works in Multiple Domains
RankCompete: Organize your photo album automatically!
MedRank: Rank treatments for AIDS from Medline
Use multi-typed image features to
build up heterogeneous networks
Explore multiple
types in a star
schema network
34. 34 Net-cluster Hierarchy
iNextCube: Information Network-Enhanced Text Cube
(VLDB’09 Demo)
Architecture of iNextCube
Dimension hierarchies generated by NetClus
Author/conference/
term ranking for each
research area. The
research areas can be
at different levels.
All
DB and IS Theory Architecture …
DB DM IR
XML Distributed DB …
Demo: iNextCube.cs.uiuc.edu
35. 35
Impact of RankClus Methodology
RankCompete [Cao et al., WWW’10]
Extend to the domain of web images
RankClus in Medical Literature [Li et al., Working paper]
Ranking treatments for diseases
RankClass [Ji et al., KDD’11]
Integrate classification with ranking
Trustworthy Analysis [Gupta et al., WWW’11] [Khac Le et al., IPSN’11]
Integrate clustering with trustworthiness score
Topic Modeling in Heterogeneous Networks [Deng et al., KDD’11]
Propagate topic information among different types of objects
…
37. 37
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
38. 38
Classification: Knowledge Propagation Across
Heterogeneous Typed Networks
RankClass [Ji et al., KDD’11]:
Ranking-based classification
Highly ranked objects will
play more role in
classification
Class membership and
ranking are statistical
distributions
Let ranking and
classification mutually
enhance each other!
Output: Classification
results + ranking list of
objects within each class
Classification: Labeled knowledge propagates through multi-typed
objects across heterogeneous networks [KDD’11]
39. 39
GNetMine: Methodology
Classification of networked data can be essentially viewed as a process of
knowledge propagation, where information is propagated from labeled objects
to unlabeled ones through links until a stationary state is achieved
A novel graph-based regularization framework to address the classification
problem on heterogeneous information networks
Respect the link type differences by preserving consistency over each relation
graph corresponding to each type of links separately
Mathematical intuition: Consistency assumption
The confidence (f)of two objects (xip and xjq) belonging to class k should be
similar if xip ↔ xjq (Rij,pq > 0)
f should be similar to the given ground truth on the labeled data
40. 40
GNetMine: Graph-Based Regularization
Minimize the objective function
( ) ( )
1
( ) ( ) 2
,
, 1 1 1 , ,
( ) ( ) ( ) ( )
1
( ,..., )
1 1
( )
( ) ( )
ji
k k
m
nnm
k k
ij ij pq ip jq
i j p q ij pp ji qq
m
k k T k k
i i i i i
i
J
R f f
D D
y y
f f
f f
Smoothness constraints: objects linked together should share
similar estimations of confidence belonging to class k
Normalization term applied to each type of link separately:
reduce the impact of popularity of objects
Confidence estimation on labeled data and their pre-given
labels should be similar
User preference: how much do you
value this relationship / ground truth?
41. 41
RankClass: Ranking-Based Classification
Classification in heterogeneous networks
Knowledge propagation: Class label knowledge propagated across multi-typed
objects through a heterogeneous network
GNetMine [Ji et al., PKDD’10]: Objects are treated equally
RankClass [Ji et al., KDD’11]: Ranking-based classification
Highly ranked objects will play more role in classification
An object can only be ranked high in some focused classes
Class membership and ranking are stat. distributions
Let ranking and classification mutually enhance each other!
Output: Classification results + ranking list of objects within each class
42. 42
From RankClus to GNetMine & RankClass
RankClus [EDBT’09]: Clustering and ranking working together
No training, no available class labels, no expert knowledge
GNetMine [PKDD’10]: Incorp. label information in networks
Classification in heterog. networks, but objects treated equally
RankClass [KDD’11]: Integration of ranking and classification in heterogeneous
network analysis
Ranking: informative understanding & summary of each class
Class membership is critical information when ranking objects
Let ranking and classification mutually enhance each other!
Output: Classification results + ranking list of objects within each class
43. 43
Why Rank-Based Classification?
Different objects within one class have different importance/visibility!
The ranking of objects within one class serves as an informative understanding and
summary of the class
6
11
7
9
8
10
44. 44
Motivation
Why not do ranking after classification, or vice versa?
Because they mutually enhance each other, not unidirectional.
RankClass: iteratively let ranking and classification mutually enhance each other
Ranking Classification
46. 46
Comparing Classification Accuracy on the DBLP Data
Class: Four research areas:
DB, DM, AI, IR
Four types of objects
Paper(14376), Conf.(20),
Author(14475),
Term(8920)
Three types of relations
Paper-conf., paper-
author, paper-term
Algorithms for comparison
LLGC [Zhou et al. NIPS’03]
wvRN) [Macskassy et al.
JMLR’07]
nLB [Lu et al. ICML’03,
Macskassy et al.
JMLR’07]
47. 47
Object Ranking in Each Class: Experiment
DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. Network
Rank objects within each class (with extremely limited label information)
Obtain high classification accuracy and excellent ranking within each class
Database Data Mining AI IR
Top-5 ranked
conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked
terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
49. 49
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
50. 50
Similarity Search in Heterogeneous Networks
Similarity measure/search is the base for cluster analysis
Who are the most similar to Christos Faloutsos based on the DBLP network?
Meta-Path: Meta-level description of a path between two objects
A path on network schema
Denote an existing or concatenated relation between two object types
Different meta-paths tell different semantics
Christos’ students or close collaborators Work in similar fields with similar reputation
Meta-Path: Author-Paper-Author Meta-Path: Author-Paper-Venue-Paper-Author
Co-authorship
Meta-path: A-P-A
51. 51
Existing Popular Similarity Measures for Networks
Random walk (RW):
The probability of random walk starting at x and ending at y, with
meta-path P
Used in Personalized PageRank (P-Pagerank) (Jeh and Widom 2003)
Favors highly visible objects (i.e., objects with large degrees)
Pairwise random walk (PRW):
The probability of pairwise random walk starting at (x, y) and ending
at a common object (say z), following a meta-path (P1, P2)
Used in SimRank (Jeh and Widom 2002)
Favors pure objects (i.e., objects with highly skewed distribution in
their in-links or out-links)
( , ) ( )p P
s x y prob p
1 2 1 2
1 2( , ) ( , )
( , ) ( ) ( )p p P P
s x y prob p prob p
X Y
P
X Y
P1
P2
-1
Z
Note: P-PageRank
and SimRank do
not distinguish
object type and
relationship type
52. 52
Which Similarity Measure Is Better for Finding Peers?
Meta-path: APCPA
Mike publishes similar # of papers as Bob
and Mary
Other measures find Mike is closer to Jim
AuthorConf. SIGMOD VLDB ICDM KDD
Mike 2 1 0 0
Jim 50 20 0 0
Mary 2 0 1 0
Bob 2 1 0 0
Ann 0 0 1 1
MeasureAuthor Jim Mary Bob Ann
P-PageRank 0.376 0.013 0.016 0.005
SimRank 0.716 0.572 0.713 0.184
Random Walk 0.8983 0.0238 0.0390 0
Pairwise R.W. 0.5714 0.4440 0.5556 0
PathSim (APCPA) 0.083 0.8 1 0
Who is more
similar to
Mike?
Comparison of Multiple Measures: A Toy Example
PathSim: Favors peers
Peers: Objects with
strong connectivity and
similar visibility with a
given meta-path
x y
53. 53
Example with DBLP: Find Academic Peers by PathSim
Anhai Doan
CS, Wisconsin
Database area
PhD: 2002
Jun Yang
CS, Duke
Database area
PhD: 2001
Amol Deshpande
CS, Maryland
Database area
PhD: 2004
Jignesh Patel
CS, Wisconsin
Database area
PhD: 1998
Meta-Path: Author-Paper-Venue-Paper-Author
54. 54
Some Meta-Path Is “Better” Than Others
Which pictures are most similar to this one?
Image
Group
UserTag
Meta-Path: Image-Tag-Image Meta-Path: Image-Tag-Image-Group-Image-Tag-Image
Evaluate the similarity
between images according
to tags and groups
Evaluate the similarity
between images according
to their linked tags
55. 55
Comparing Similarity Measures in DBLP Data
Favor highly
visible objects
Are these tiny forums
most similar to SIGMOD?
Which venues are most
similar to DASFAA?
Which venues are most
similar to SIGMOD?
56. 56
Long Meta-Path May Not Carry the Right Semantics
Repeat the meta-path 2, 4, and infinite times for conference similarity query
57. 57
Co-Clustering-Based Pruning Algorithm
Meta-Path based similarity computation can be costly
The overall cost can be reduced by storing commuting matrices
for short path schemas and computing top-k queries on line
Framework
Generate co-clusters for materialized commuting
matrices for feature objects and target objects
Derive upper bound for similarity between object and
target cluster and between object and object
Safely prune target clusters and objects if the upper bound
similarity is lower than current threshold
Dynamically update top-k threshold
58. 58
Meta-Path: A Key Concept for Heterogeneous Networks
Meta-path based mining
PathPredict [Sun et al., ASONAM’11]
Co-authorship prediction using meta-path based similarity
PathPredict_when [Sun et al., WSDM’12]
When a relationship will happen
Citation prediction [Yu et al., SDM’12]
Meta-path + topic
Meta-path learning
User Guided Meta-Path Selection [Sun et al., KDD’12]
Meta-path selection + clustering
60. 60
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
61. 61
Relationship Prediction vs. Link Prediction
Link prediction in homogeneous networks [Liben-Nowell and Kleinberg, 2003,
Hasan et al., 2006]
E.g., friendship prediction
Relationship prediction in heterogeneous networks
Different types of relationships need different prediction models
Different connection paths need to be treated separately!
Meta-path-based approach to define topological features
vs.
vs.
62. 62
Why Prediction Using Heterogeneous Info Networks?
Rich semantics contained in structured, text-rich heterogeneous networks
Homogeneous networks, such as coauthor networks, miss too much criticially
important information
Schema-guided relationship prediction
Semantic relationships among similar typed links share similar semantics and are
comparable and inferable
Relationships across different typed links are not directly comparable but their
collective behavior will help predict particular relationships
Meta-paths: encoding topological features
E.g., citation relations between authors: A — P → P — A
Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship
Prediction in Heterogeneous Bibliographic Networks", ASONAM'11
cite writewrite
63. 63
PathPredict: Meta-Path Based Relationship Prediction
Who will be your new coauthors in the next 5 years?
Meta path-guided prediction of links and relationships
Philosophy: Meta path relationships among similar typed links share
similar semantics and are comparable and inferable
Co-author prediction (A—P—A) [Sun et al., ASONAM’11]
Use topological features encoded by meta paths, e.g., citation
relations between authors (A—P→P—A)
vs.
Meta-paths
between
authors of
length ≤ 4
Meta-
Path
Semantic
Meaning
64. 64
Logistic Regression-Based Prediction Model
Training and test pair: <xi, yi> = <history feature list, future relationship label>
Logistic Regression Model
Model the probability for each relationship as
𝛽 is the coefficients for each feature (including a constant 1)
MLE estimation
Maximize the likelihood of observing all the relationships in the training data
A—P—A—P—A A—P—V—P—A A—P—T—P—A A—P→P—A A—P—A
<Mike, Ann> 4 5 100 3 Yes = 1
<Mike, Jim> 0 1 20 2 No = 0
65. 65
Selection among Competitive Measures
We study four measures that defines a relationship R encoded by a meta path
Path Count: Number of path instances between authors following R
Normalized Path Count: Normalize path count following R by the “degree”
of authors
Random Walk: Consider one way random walk following R
Symmetric Random Walk: Consider random walk in both directions
66. 66
Performance Comparison: Homogeneous vs.
Heterogeneous Topological Features
Homogeneous features
Only consider co-author sub-network (common neighbor; rooted PageRank)
Mix all types together (homogeneous path count)
Heterogeneous feature
Heterogeneous path count
Notation: HP2hop: highly productive source
authors with 2-hops reaching target authors
67. 67
Comparison among Different Topological Features
Hybrid heterogeneous topological feature is the best
Notations
(1) the path count (PC)
(2) the normalized path count (NPC)
(3) the random walk (RW)
(4) the symmetric random walk (SRW)
PC1: homogeneous common neighbor
PCSum: homogeneous path count
68. 68
The Power of PathPredict: Experiment on DBLP
Explain the prediction power of
each meta-path
Wald Test for logistic
regression
Higher prediction accuracy than
using projected homogeneous
network
11% higher in prediction
accuracy
Co-author prediction for Jian Pei: Only 42 among 4809
candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in [2003,2009])
69. 69
The Power of PathPredict: Experiment on DBLP
Explain the prediction power of each
meta-path
Wald Test for logistic regression
Higher prediction accuracy than using
projected homogeneous network
11% higher in prediction accuracy
Co-author prediction for Jian Pei: Only 42 among 4809
candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in [2003,2009])
Evaluation of the prediction
power of different meta-paths
Prediction of new coauthors
of Jian Pei in [2003-2009]
Social relations
play more
important role?
70. 70
When Will It Happen?
From “whether” to “when”
“Whether”: Will Jim rent the movie “Avatar” in Netflix?
“When”: When will Jim rent the movie “Avatar”?
What is the probability Jim will rent “Avatar” within 2 months? 𝑃(𝑌 ≤ 2)
By when Jim will rent “Avatar” with 90% probability? 𝑡: 𝑃 𝑌 ≤ 𝑡 = 0.9
What is the expected time it will take for Jim to rent “Avatar”? 𝐸(𝑌)
May provide useful
information to supply
chain management
Output: P(X=1)=?
Output: A distribution of time!
71. 71
Generalized Linear Model
under Weibull Distribution Assumption
The Relationship Building Time Prediction Model
Solution
Directly model relationship building time: P(Y=t)
Geometric distribution, Exponential distribution, Weibull distribution
Use generalized linear model
Deal with censoring (relationship builds beyond the observed time interval)
Training Framework
T: Right
Censoring
72. 72
Author Citation Time Prediction in DBLP
Top-4 meta-paths for author citation time prediction
Predict when Philip S. Yu will cite a new author
Social relations are less important in author citation
prediction than in co-author prediction
Under Weibull distribution assumption
Study the same topic
Co-cited by the same paper
Follow co-authors’ citation
Follow the citations of authors
who study the same topic
74. 74
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
75. 75
Enhancing the Power of Recommender Systems by Heterog.
Info. Network Analysis
Heterogeneous relationships complement each other
Users and items with limited feedback can be connected to the network by
different types of paths
Connect new users or items in the information network
Different users may require different models: Relationship heterogeneity makes
personalized recommendation models easier to define
Avatar TitanicAliens Revolutionary
Road
James
Cameron
Kate
Winslet
Leonardo
Dicaprio
Zoe
Saldana
Adventure
Romance
Collaborative filtering methods suffer from
the data sparsity issue
# of users or items
A small set
of users &
items have a
large number
of ratings
Most users & items have a
small number of ratings
#ofratings
Personalized recommendation with heterog.
Networks [WSDM’14]
76. 76
Relationship Heterogeneity Based Personalized
Recommendation Models
Different users may have different behaviors or preferences
Aliens
James Cameron fan
80s Sci-fi fan
Sigourney Weaver fan
Different users may be
interested in the same
movie for different reasons
Two levels of personalization
Data level
Most recommendation methods use one
model for all users and rely on personal
feedback to achieve personalization
Model level
With different entity relationships, we can
learn personalized models for different
users to further distinguish their differences
77. 77
Preference Propagation-Based Latent Features
Alice
Bob
Kate Winslet
Naomi Watts
Titanic
revolutionary
road
skyfall
King Konggenre: drama
Sam Mendes
tag: Oscar NominationCharlie
Generate L different
meta-path (path
types) connecting
users and items
Propagate user
implicit feedback
along each meta-
path
Calculate latent-features
for users and items for
each meta-path with
NMF related method
Ralph Fiennes
78. 78
L
user-cluster similarity
Recommendation Models
Observation 1: Different meta-paths may have different importance
Global Recommendation Model
Personalized Recommendation Model
Observation 2: Different users may require different models
ranking score
the q-th meta-path
features for user i and item j
c total soft user clusters
(1)
(2)
79. 79
Parameter Estimation
• Bayesian personalized ranking (Rendle UAI’09)
• Objective function
min
Θ
sigmoid function
for each correctly ranked item pair
i.e., 𝑢𝑖 gave feedback to 𝑒 𝑎 but not 𝑒 𝑏
Soft cluster users with
NMF + k-means
For each user
cluster, learn
one model with
Eq. (3)
Generate
personalized model
for each user on the
fly with Eq. (2)
(3)
Learning Personalized Recommendation Model
80. 80
Experiments: Heterogeneous Network Modeling
Enhances the Quality of Recommendation
Datasets
Comparison methods
Popularity: recommend the most popular items to users
Co-click: conditional probabilities between items
NMF: non-negative matrix factorization on user feedback
Hybrid-SVM: use Rank-SVM with plain features (utilize both user feedback and
information network)
HeteRec personalized
recommendation
(HeteRec-p) leads to
the best
recommendation
82. 82
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
83. 83
Why User Guidance in Clustering?
Different users may like to get different clusters for different clustering goals
Ex. Clustering authors based on their connections in the network
{1,2,3,4}
{5,6,7,8} {1,3,5,7}
{2,4,6,8}
{1,3}
{2,4}
{5,7}
{6,8}
Which meta-
path(s) to choose?
84. 84
User Guidance Determines Clustering Results
Different user preferences (e.g., by
seeding desired clusters) lead to the
choice of different meta-paths
{1}
{5} {1,2,3,4}
{5,6,7,8}+
{1}
{2}
{5}
{6}
{1,3}
{2,4}
{5,7}
{6,8}
+
Seeds Meta-path(s) Clustering
Seeds Meta-path(s) Clustering
Problem: User-guided clustering
with meta-path selection
Input:
The target type for clustering T
# of clusters k
Seeds in some clusters: L1, …, Lk
Candidate meta-paths: P1, …, PM
Output:
Weight of each meta-path: w1, …,
wm
Clustering results that are
consistent with the user guidance
85. 85
PathSelClus: A Probabilistic Modeling Approach
Part 1: Modeling the Relationship Generation
A good clustering result should lead to high likelihood in observing existing
relationships
Higher quality relations should count more in the total likelihood
Part 2: Modeling the Guidance from Users
The more consistent with the guidance, the higher probability of the clustering
result
Part 3: Modeling the Quality Weights for Meta-Paths
The more consistent with the clustering result, the higher quality weight
92. 92
Outline
Motivation: Why Mining Information Networks?
Part I: Clustering, Ranking and Classification
Clustering and Ranking in Information Networks
Classification of Information Networks
Part II: Meta-Path Based Exploration of Information Networks
Similarity Search in Information Networks
Relationship Prediction in Information Networks
Recommendation with Heterogeneous Information Networks
User-Guided Meta-Path based clustering in heterogeneous networks
ClusCite: Citation recommendation in heterogeneous networks
93. 93
The Citation Recommendation Problem
Given a newly written manuscript (title, abstract and/or content) and its attributes
(authors, target venues), suggests a small number of papers as high quality references.
94. 94
Why ClusCite for Citation Recommendation?
Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang,
Jiawei Han, “ClusCite: Effective Citation Recommendation by Information
Network-Based Clustering”, KDD’14
Research papers need to cite relevant and important previous work
Help readers understand its background, context and innovation
Already large, rapidly growing body of scientific literature
automatic recommendations of high quality citations
Traditional literature search systems
infeasible to cast users’ rich information needs
into queries consisting of just a few keywords
95. 95
Consider Distinct Citation Behavioral Patterns
Basic idea: Paper’s citations can be organized into different groups, each
having its own behavioral pattern to identify references of interest
A principled way to capture paper’s citation behaviors
More accurate approach:paper-specific recommendation model
Most existing studies: assume all papers adopt same criterion and follow
same behavioral pattern in citing other papers:
Context-based [He et al., WWW’10; Huang et al., CIKM’12]
Topical similarity-based [Nallapati et al., KDD’08; Tang et al., PAKDD’09]
Structural similarity-based [Liben-Nowell et al., CIKM’03; Strohman et al.,
SIGIR’07]
Hybrid methods [Bethard et al., CIKM’10; Yu et al., SDM’12]
96. 96
Observation: Distinct Citation Behavioral Patterns
Each group follow distinct
behavioral patterns and
adopt different criterions
in deciding relevance and
authority of a candidate
paper
97. 97
Heterogeneous Bibliographic Network
A unified data model for bibliographic dataset (papers and their attributes)
• Captures paper-paper relevance of different semantics
• Enables authority propagation between different types of objects
KDD
literature search
WSDM
Network Schema
98. 98
Citation Recommendation in Heterogeneous Networks
Given heterogeneous bibliographic network; terms, authors and target
venues of a query manuscript q,we aim to build a recommendation model
specifically for q,and recommend a small subset of papers as high quality
references for q
Heterogeneous bibliographic network
Query (to the system)
99. 99
The ClusCite Framework
We explore the principle that: citations tend to be softly clustered into different
interest groups, based on the heterogeneous network structures
Paper-specific recommendation:
by integrating learned models of
query’s related interest groups
Phase I: Joint Learning (offline)
Phase II: Recommendation (online)
For different interest groups, learn
distinct models on finding relevant
papers and judging authority of papers
Derive group membership for
query manuscript
100. 100
ClusCite: An Overview of the Proposed Model
query’s group membership relative citation score (how likely q
will cite p) within each group
How likely a query manuscript q will cite a candidate paper p (suppose K interest groups):
It is desirable to suggest papers that have high relative citation scores across
multiple related interest groups of the query manuscript
paper relative relevance
(query-candidate paper)
paper relative authority
(candidate paper)
101. 101
Proposed Model I: Group Membership
Learn each query’s group membership: scalability & generalizability
Leverage the group memberships of related attribute objects to approximate
query’s group membership
Different types of attribute
objects (X =
authors/venues/terms)
Query’s related (linked)
objects of type-X
Attribute object’s group
membership (to learn)
102. 102
Proposed Model II: Paper Relevance
meta path-based relevance score (l-th feature)
Network schema
103. 103
Proposed Model II: Paper Relevance
Relevance features play
different roles in
different interest groups
weights on different meta path-based features
104. 104
Proposed Model III: Object Relative Authority
Paper relative authority: A paper may have quite different visibility/authority
among different groups, even it is overall highly cited
Relative authority propagation over the network
KDDWSDM
Relative authority in Group A
Relative authority in Group B
105. 105
Proposed Model III: Object Relative Authority
Paper relative authority: A paper may have quite different visibility/authority
among different groups, even it is overall highly cited
and pa-
putes the
oup. The
final rec-
mbership
sent how
groups.
ce
of meta
der vari-
re, could
different
ong with
ated pa-
because
d ICDM)
riving relative authority separately, we propose to estimate
them jointly using graph regularization, which preservescon-
sistency over each subgraph. By doing so, paper relative
authority serves as a feature for learning interest groups,
and better estimated groups can in turn help derive relative
authority more accurately (Fig. 3).
We adopt the semi-supervised learning framework [8] that
leads to iteratively updating rules as authority propagation
between different types of objects.
FP = GP FP , FA , FV ; λA , λV ,
FA = GA FP and FV = GV FP . (3)
We denote relative authority score matrices for paper, au-
thor and venue objects by FP ∈ RK × n
, FA ∈ RK × |A | and
FV ∈ RK × |V | . Generally, in an interest group, relative im-
portance of one type of object could be a combination of the
relative importance from different types of objects [23].
106. 106
Model Learning: A Joint Optimization Problem
A joint optimization problem:
ance may not
mization prob-
neously, which
egularization.
bject in terms
earned model
n Sec. 4.1 and
4.2 along with
4.3.
m
on network as
ed citation re-
e of negatives
d may cite in
methods adopt
ctive functions
egative exam-
subgraph R(V ) .
Integrating theloss in Eq. (5) with graph regularization in
Eq. (6), we formulate a joint optimization problem follow-
ing the graph-regularized semi-supervised learning frame-
work [9]:
min
P ,W ,F P ,F A ,F V
1
2
L + R +
cp
2
∥P∥2
F +
cw
2
∥W ∥2
F
s.t. P ≥ 0; W ≥ 0. (7)
To ensurestability of theobtained solution, Tikhonov reg-
ularizers are imposed on variables P and W [4], and we use
cp , cw > 0 to control the strength of regularization. In ad-
dition, we impose non-negativity constraints to make sure
learned group membership indicators and feature weights
can provide semantic meaning as we want.
4.2 The ClusCite Algorithm
Directly solving Eq (7) is not easy because the objective
function is non-convex. We develop an alternative mini-
mization algorithm, called ClusCit e, which alternatively
formance, which is defined as follows:
L =
n
i ,j = 1
Mi j Yi j −
K
k = 1
L
l = 1
θ
(k )
pi w
(l )
k S
(i )
j l −
K
k = 1
θ
(k )
pi FP ,k j
2
,
=
n
i = 1
M i ⊙ Y i − R i P(W S(i ) T
+ FP )
2
2
.
(5)
We define the weight indicator matrix M ∈ Rn × n
for the
citation matrix, where M i j takes value 1 if the citation rela-
tionship between pi and pj is observed and 0 in other cases.
By doing so, the model can focus on positive examples and
get rid of noise in the 0 values. One can also impose a de-
gree of confidence by setting different weights on different
observed citations, but we leave this to future study.
For ease of optimization, the loss can be further rewrit-
ten in a matrix form, where matrix P ∈ R(|T |+ |A |+ |V |) × K
is
nd other conceptual information. Specifically, we
the query’s group membership θ
( k )
q by weighted
on of group memberships of its attribute objects.
θ( k )
q =
X ∈ { A ,V ,T } x ∈ N X (q)
θ
(k )
x
|NX (q)|
. (4)
e NX (q) to denote type X neighbors for query q,
tribute objects. How likely a type X object is to
the k-th interest group is represented by θ
(k )
x .
specific citation recommendation can be efficiently
d for each query manuscript q by applying Eq. (1)
h definitions in Eqs. (2), (3) and (4).
ODEL LEARNING
ction introduces thelearning algorithm for thepro-
ation recommendation model in Eq. (1).
are three sets of parameters in our model: group
hips for attribute objects; feature weights for inter-
s; and object relative authority within each inter-
. A straightforward way is to first conduct hard-
g of attribute objects based on the network and
ve feature weights and relative authority for each
Such a solution encounters several problems: (1)
ct may have multiple citation interests, (2) mined
usters may not properly capture distinct citation
ten in a matrix form, where matrix P ∈ R( |T |+ |A |+ |V |)× K
is
group membership indicator for all attribute objects while
R i ∈ Rn × ((|T |+ |A |+ |V |))
is thecorresponding neighbor indica-
tor matrix such that R i P = X ∈ { A ,V ,T } x ∈ N X ( pi )
θ
( k )
x
|N X ( pi ) |
.
Feature weights for each interest group are represented by
each row of thematrix W ∈ RK × L
, i.e., Wk l = w
( l )
k . Hadamard
product ⊙is used for the matrix element-wise product.
As discussed in Sec. 3.3, to achieve authority learning
jointly, we adopt graph regularization to preserve consisten-
cy over the paper-author and paper-venue subgraphs, which
takes the following form:
R = λ A
2
n
i = 1
|A |
j = 1
R
(A )
i j
F P , i
D
( P A )
i i
−
F A , j
D
( A P )
j j
2
2
+ λ V
2
n
i = 1
|V |
j = 1
R
( V )
i j
F P , i
D
( P V )
i i
−
F V , j
D
( V P )
j j
2
2
.
(6)
Theintuition behind theabovetwo termsisnatural: Linked
objects in theheterogeneousnetwork are morelikely to share
similar relative authority scores [14]. To reduce impact of
node popularity, we apply a normalization technique on au-
thority vectors, which helpssuppresspopular objects to keep
them from dominating the authority propagation. Each
element in the diagonal matrix D ( P A ) ∈ Rn × n
is the de-
gree of paper pi in subgraph R(A ) while each element in
D ( A P ) ∈ R|A |× |A |
is the degree of author aj in subgraph
R(A ) Similarly, we can define the two diagonal matrices for
weighted model prediction error
Graph
regularization
to encode
authority
propagation
Algorithm: Alternating minimization (w.r.t. each variable)
107. 107
Learning Algorithm
Updating formula for alternating minimization
First, to learn group membership for attribute objects,
we take the derivative of the objective function in Eq. (7)
with respect to P while fixing other variables, and apply
the Karush-Kuhn-Tucker complementary condition to im-
pose the non-negativity constraint [7]. With some simple
algebraic operations, a multiplicative update formula for P
can be derived as follows:
Pj k ← Pj k
n
i = 1
R T
i
˜Y i S( i )
W T
+ L+
P 1 + L−
P 2
j k
LP 0 + L−
P 1 + L+
P 2 + cp P
j k
, (8)
where matrices LP 0, LP 1 and LP 2 are defined as follows:
LP 0 =
n
i = 1
R T
i Ri PW ˜S( i )T ˜S(i )
W T
; LP 1 =
n
i = 1
R T
i
˜Y i FT
P ;
LP 2 =
n
i = 1
R T
i Ri P ˜F
( i )
P
˜F
( i )T
P +
n
i = 1
RT
i Ri PW ˜S( i )T
FT
P
+
n
i = 1
RT
i Ri PFP
˜S(i )
W T
.
In order to preserve non-negativity throughout the update,
L P 1 is decomposed into L −
P 1 and L +
P 1 where A+
i j = (|Ai j | +
Ai j )/ 2 and A−
i j = (|Ai j | − Ai j )/ 2. Similarly, we decompose
L into L −
and L +
, but note that the decomposition is
i = 1
In order to preserve non-negativity throughout the update,
L P 1 is decomposed into L −
P 1 and L +
P 1 where A+
i j = (|Ai j | +
Ai j )/ 2 and A−
i j = (|Ai j | − Ai j )/ 2. Similarly, we decompose
L P 2 into L −
P 2 and L +
P 2, but note that the decomposition is
applied to each of the three components of L P 2, respectively.
We denote the masked Y i as ˜Y i , which is the Hadamard
product of M i and Y i . Similarly, ˜S(i ) and ˜F
(i )
P denote row-
wise masked S(i ) and F P by M i .
Second, to learn feature weights for interest groups, the
multiplicative update formula for W can be derived follow-
ing a similar derivation as that of P, taking the form:
Wk l ← Wk l
n
i = 1
PT
R T
i
˜Y i S( i )
+ L−
W
k l
n
i = 1
PT R T
i R i PW ˜S(i ) T ˜S(i ) + L +
W + cw W
k l
,
(9)
where we have L W = n
i = 1 P T R T
i R i P F P
˜S( i ) .
Similarly, to preserve non-negativity of W , L W is decom-
posed into L +
W and L −
W , which can be computed same be-
fore.
Finally, we derive the authority propagation functions in
Eq. (3) by optimizing the objective function in Eq. (7) with
respect to the authority score matrices of papers, authors
arned group membership indicators and feature weights
n provide semantic meaning as we want.
2 The ClusCite Algorithm
Directly solving Eq (7) is not easy because the objective
nction is non-convex. We develop an alternative mini-
zation algorithm, called ClusCit e, which alternatively
timizes the problem with respect to each variable.
Thelearning algorithm essentially accomplishestwo things
multaneously and iteratively: Co-clustering of attribute
jectsand relevancefeatureswith respect to interest groups,
d authority propagation between different objects. Dur-
g an iteration, different learning components will mutually
hance each other (Fig. 3): Feature weights and relative
thority can be more accurately derived with high quality
terest groups while in turn they serve a good feature for
arning high quality interest groups.
First, to learn group membership for attribute objects,
e take the derivative of the objective function in Eq. (7)
th respect to P while fixing other variables, and apply
e Karush-Kuhn-Tucker complementary condition to im-
ose the non-negativity constraint [7]. With some simple
gebraic operations, a multiplicative update formula for P
n be derived as follows:
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 1
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 5
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 10
Figure 3: Correlat ion bet ween paper relat ive au-
t hority and # ground t rut h cit at ions, during differ-
ent it erat ions.
and venues. Specifically, we take the derivative of the ob-
jective function with respect to F P , F A and FV , and follow
traditional semi-supervised learning frameworks [8] to derive
the update rules, which take the form:
FP = GP (FP , FA , FV ; λA , λV )
=
1
λA + λV
λA FA ST
A + λV FV ST
V + L F P (10)
FA = GA (FP ) = FP SA ; (11)
FV = GV (FP ) = FP SV . (12)
where we have normalized adjacency matrices and the paper
authority guidance terms defined as follows:
SA = (D (P A )
)− 1/ 2
R ( A )
(D ( A P )
)− 1/ 2
A lgor it hm 1 Model Learning by ClusCite
I nput : adjacency matrices { Y , SA , SV } , neighbor indicator
R , mask matrix M , meta path-based features { S(i )
} , pa-
rameters { λA , λV , cw , cp } , number of interest groups K
Out put : group membership P, feature weights W , relative
authority { FP , FA , FV }
1: Initialize P, W with positive values, and { FP , FA , FV }
with citation counts from training set
2: repeat
3: Update group membership P by Eq. (8)
4: Update feature weights W by Eq. (9)
5: Compute paper relative authority FP by Eq. (10)
6: Compute author relative authority FA by Eq. (11)
7: Compute venue relative authority FV by Eq. (12)
8: unt il objective in Eq. (7) converges
membership matrix P by Eq. (8) takesO(L|E|3
/ n3
+ L2
|E|2
/ n2
+ K dn) time. Learning the feature weights W by Eq. (9)
takes O(L|E|3
/ n3
+ L2
|E|2
/ n2
+ K dn) time. Updating all
three relative authority matrices takes O(L|E |3
/ n3
+ |E| +
Table 3: St at ist ics of two bibliogr aphic networ ks.
Data sets D B L P P ubM ed
# papers 137,298 100,215
# authors 135,612 212,312
# venues 2,639 2,319
# terms 29,814 37,618
# relationships ∼2.3M ∼3.6M
Paper avg citations 5.16 17.55
Table 4: Tr aining, validat ion and t est ing paper sub-
set s from t he D B L P and PubM ed dat aset s
(a) The DBLP dataset
Subsets Train Validation Test
Years T0= [1996, 2007] T1= [2008] T2 = [2009, 2011]
# papers 62.23% 12.56% 25.21%
(b) The PubMed dataset
Subsets Train Validation Test
Years T0= [1966, 2008] T1= [2009] T2 = [2010, 2013]
# papers 64.50% 7.81% 27.69%
set (T0) as the ground truth. Such an evaluation practice
108. 108
Learning Algorithm
Convergence (to local minimum):
Guaranteed by an analysis similar to block coordinate descent
Empirically converges in 50 iterations
Mutually enhancing:
The mode learning does two things simultaneously and iteratively: (1) co-
clustering of attribute objects; (2) authority propagation within groups
Authority/relevance features can be better learned with high-quality groups
derived
Good features help deriving high-quality groups
109. 109
Experimental Setup
Datasets
– DBLP: 137k papers; ~2.3M relationships; Avg # citations/paper: 5.16
– PubMed: 100k papers; ~3.6M relationships; Avg # citations/paper: 17.55
Evaluations
hbor indicator
res { S(i )
} , pa-
groups K
hts W , relative
{ FP , FA , FV }
8)
by Eq. (10)
by Eq. (11)
by Eq. (12)
3
/ n3
+ L2
|E |2
/ n2
W by Eq. (9)
. Updating all
E|3
/ n3
+ |E | +
compute Clus-
# papers 137,298 100,215
# authors 135,612 212,312
# venues 2,639 2,319
# terms 29,814 37,618
# relationships ∼2.3M ∼3.6M
Paper avg citations 5.16 17.55
Table 4: Training, validat ion and t est ing paper sub-
set s from t he D B L P and PubM ed dat aset s
(a) The DBLP dataset
Subsets Train Validation Test
Years T0= [1996, 2007] T1= [2008] T2= [2009, 2011]
# papers 62.23% 12.56% 25.21%
(b) The PubMed dataset
Subsets Train Validation Test
Years T0= [1966, 2008] T1= [2009] T2= [2010, 2013]
# papers 64.50% 7.81% 27.69%
set (T0) as the ground truth. Such an evaluation practice
is more realistic because a citation recommendation system
110. 110
Experimental Setup
Feature generation: meta paths
– (P – X – P)y, where X = {A, V, T} and y = {1, 2, 3}
E.g., (P – X – P)2 = P – X – P – X – P
– P – X – P P
– P – X – P P
Feature generation: similarity measures
– PathSim [VLDB’11]: symmetric measure
– Random walk-based measure [SDM’12]: can be asymmetric
Combining meta paths with different measures, in total we have 24 meta path-
based relevance features
111. 111
Example Output: Relative Authority Ranking
research area. This demonstrates that the relative authority
propagation can generate meaningful authority scores with
respect to different interest groups.
Table 6: Top-5 aut hor ity venues and aut hor s from
t wo exam ple int er est gr oups der ived by ClusCit e.
R ank Venue A ut hor
G r oup I (dat abase and infor m at ion syst em )
1 VLDB 0.0763 Hector Garcia-Molina 0.0202
2 SIGMOD 0.0653 Christos Faloutsos 0.0187
3 T K DE 0.0651 Elisa Bertino 0.0180
4 CIK M 0.0590 Dan Suciu 0.0179
5 SIGK DD 0.0488 H. V. Jagadish 0.0178
G r oup I I ( com put er vision and m ult im edia)
1 T PAMI 0.0733 Richard Szeliski 0.0139
2 ACM MM 0.0533 Jitendra Malik 0.0122
3 ICCV 0.0403 Luc Van Gool 0.0121
4 CVPR 0.0401 Andrew Blake 0.0117
5 ECCV 0.0393 Alex Pentland 0.0114
to expres
contexts
recomme
relevant
size of ea
interestin
the citat
On the
measure
tion tech
niques [1
or no cit
erogenou
issue by
between
based m
consideri
ies start
112. 112
Experimental Results
Case study on citation behavioral patterns
Each paper
is assigned
to the group
with highest
group
membership
score
113. 113
Experimental Results
Quality change of estimated paper relative authority ranking
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 1
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 5
0 0.2 0.4 0.6 0.8 1
0
2
4
6
8
10
12
14
16
18
20
Paper relative authority score
#citationsfromtestset
Iteration 10
Figure 3: Correlat ion bet ween paper relat ive au-
t hority and # ground t rut h cit at ions, during differ-
ent it erat ions.
and venues. Specifically, we take the derivative of the ob-
Correlation between paper relative authority score and # ground truth
citations, during different iterations
114. 114
Performance Comparison: Methods Compared
Compared methods:
1. BM25: content-based
2. PopRank [WWW’05]: heterogeneous link-based authority
3. TopicSim: topic-based similarity by LDA
4. Link-PLSA-LDA [KDD’08]: topic and link relevance
5. Meta-path based relevance:
1. L2-LR [SDM’12, WSDM’12]: logistics regression with L2 regularization
2. RankSVM [KDD’02]
6. MixSim: relevance, topic distribution, PopRank scores, using RankSVM
7. ClusCite: the proposed full fledged model
8. ClusCite-Rel: with only relevance component
115. 115
Performance Comparison on DBLP and PubMed
17.68% improvement in Recall@50;
9.57% in MRR, over the best performing
compared method, on DBLP
ommendation performance comparisons on D BLP and PubM ed datasets in terms of Precisio
RR. We set the number of interest groups to be 200 (K = 200) for ClusCite and ClusCite-Re
od D B LP PubM ed
P@10 P@20 R@20 R@50 MRR P@10 P@20 R@20 R@50 MRR
5 0.1260 0.0902 0.1431 0.2146 0.4107 0.1847 0.1349 0.1754 0.2470 0.4971
nk 0.0112 0.0098 0.0155 0.0308 0.0451 0.0438 0.0314 0.0402 0.0814 0.2012
Sim 0.0328 0.0273 0.0432 0.0825 0.1161 0.0761 0.0685 0.0855 0.1516 0.3254
A-LDA 0.1023 0.0893 0.1295 0.1823 0.3748 0.1439 0.1002 0.1589 0.2015 0.4079
R 0.2274 0.1677 0.2471 0.3547 0.4866 0.2527 0.1959 0.2504 0.3981 0.5308
VM 0.2372 0.1799 0.2733 0.3621 0.4989 0.2534 0.1954 0.2499 0.382 0.5187
ea 0.2261 0.1689 0.2473 0.3636 0.5002 0.2699 0.2025 0.2519 0.4021 0.5041
e-Rel 0.2402 0.1872 0.2856 0.4015 0.5156 0.2786 0.2221 0.2753 0.4305 0.5524
ite 0.2429 0.1958 0.2993 0.4279 0.5481 0.3019 0.2434 0.3129 0.4587 0.5787
20.19% improvement in Recall@20;
14.79% in MRR, over the best performing
compared method, on PubMed
117. 117
Performance Analysis: Number of Attributes Each
Query Manuscript Has
Partition the test set
into six subsets:
Group 1 has least Avg#
attribute objects
Group 6 has most Avg#
attribute objects
When more attribute
available for the query
manuscript, our model can
do better recommendation
118. 118
Performance Analysis: Model Generalization
Divide test set by year
Test the performance
change over different
query subsets
From 2010 to 2013,
MixSim drops
16.42% vs. ClusCite
drops only 7.72%
ClusCite framework: organize paper citations into interest groups, and design
cluster-based models, yielding paper-specific recommendation
Experimental results demonstrate significant improvement over state-of-the-art
methods
119. 119 119
Conclusions
Heterogeneous information networks are ubiquitous
Most datasets can be “organized” or “transformed” into “structured” multi-typed
heterogeneous info. networks
Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …
Surprisingly rich knowledge can be mined from structured heterogeneous info.
networks
Clustering, ranking, classification, path prediction, ……
Knowledge is power, but knowledge is hidden in massive, but “relatively structured”
nodes and links!
Key issue: Construction of trusted, semi-structured heterogeneous networks from
unstructured data
From data to knowledge: Much more to be explored but heterogeneous network
mining has shown high promise!
120. 120
References
M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11
X. Ren, J. Liu, X. Yu, U. Khandelwal, Q. Gu, L. Wang, J. Han, “ClusCite: Effective Citation Recommendation by
Information Network-Based Clustering”, KDD’14
Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan &
Claypool Publishers, 2012
Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network
Schema", KDD’09
Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous
Information Networks”, VLDB'11
Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous
Bibliographic Networks", ASONAM'11
Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in Heterogeneous
Information Networks”, WSDM'12
Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, T. Wu, "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09
X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt, U. Khandelwal, B. Norick, and J. Han, "Personalized Entity Recommendation: A
Heterogeneous Information Network Approach", WSDM'14
120
121. 121
From Data to Networks to Knowledge: An Evolutionary Path!
Han, Kamber and Pei,
Data Mining, 3rd ed. 2011
Yu, Han and Faloutsos (eds.),
Link Mining: Models, Algorithms
and Applications, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012
Y. Sun: SIGKDD’13 Dissertation Award
Wang and Han, Mining Latent Entity
Structures, 2015
C. Wang: SIGKDD’13 Dissertation Award