The document presents efficient algorithms for association finding and frequent association pattern mining in large graph data. It describes the problems of finding all associations connecting a set of query entities within a diameter constraint and mining frequent association patterns. The basic solutions and optimizations for association finding using distance-based pruning and distance oracles are discussed. For frequent pattern mining, it addresses generating a canonical code to uniquely represent patterns and counting code occurrences to determine frequency. Experiments on real datasets demonstrate the efficiency and scalability of the approaches.
Curtain call of zooey - what i've learned in yahoo羽祈 張
This document summarizes the author's 4 years of work experience at Yahoo. It describes their roles and accomplishments in frontend development, backend development, and machine learning model development over 1.5 to 2 year periods in each role. It also discusses lessons learned around project management, communication, analysis, automation, and innovation. The author reflects on balancing work with fun activities like after-work study groups and company-wide events.
Towards Content-Based Dataset Search - Test Collections and BeyondGong Cheng
The document discusses content-based dataset search (CBDS) as an improvement over metadata-based dataset search (MBDS). It presents ACORDAR, a test collection for ad hoc CBDS using synthetic and TREC queries on RDF datasets. Evaluation results showed that both metadata and dataset content are useful, and that TREC queries are more difficult. CBDS faces challenges including scalability, tractability, and heterogeneity, but is likely to trend as it provides higher relevance and explainability than MBDS.
Generating Compact and Relaxable Answers to Keyword Queries over Knowledge Gr...Gong Cheng
This document presents an algorithm called CORE for generating compact yet relaxable answers to keyword queries over knowledge graphs. CORE aims to balance answer compactness, defined as having a bounded diameter, with answer completeness, defined as covering the most query keywords. It provides theoretical foundations for the existence of such answers and uses a best-first search approach. An evaluation shows CORE efficiently computes answers that are more complete than alternatives while remaining compact.
Curtain call of zooey - what i've learned in yahoo羽祈 張
This document summarizes the author's 4 years of work experience at Yahoo. It describes their roles and accomplishments in frontend development, backend development, and machine learning model development over 1.5 to 2 year periods in each role. It also discusses lessons learned around project management, communication, analysis, automation, and innovation. The author reflects on balancing work with fun activities like after-work study groups and company-wide events.
Towards Content-Based Dataset Search - Test Collections and BeyondGong Cheng
The document discusses content-based dataset search (CBDS) as an improvement over metadata-based dataset search (MBDS). It presents ACORDAR, a test collection for ad hoc CBDS using synthetic and TREC queries on RDF datasets. Evaluation results showed that both metadata and dataset content are useful, and that TREC queries are more difficult. CBDS faces challenges including scalability, tractability, and heterogeneity, but is likely to trend as it provides higher relevance and explainability than MBDS.
Generating Compact and Relaxable Answers to Keyword Queries over Knowledge Gr...Gong Cheng
This document presents an algorithm called CORE for generating compact yet relaxable answers to keyword queries over knowledge graphs. CORE aims to balance answer compactness, defined as having a bounded diameter, with answer completeness, defined as covering the most query keywords. It provides theoretical foundations for the existence of such answers and uses a best-first search approach. An evaluation shows CORE efficiently computes answers that are more complete than alternatives while remaining compact.
Semantic Data Retrieval: Search, Ranking, and SummarizationGong Cheng
Gong Cheng presented on semantic data retrieval, including entity retrieval and association retrieval from semantic graphs. He discussed two main challenges: efficiently searching large graphs for associations within a diameter bound, and ranking the retrieved associations. For the first challenge, he proposed algorithms using path finding, pruning, and result deduplication. For the second challenge, he conducted a user study and found that association size was the most important ranking factor. Other proposed measures like entity homogeneity and relation heterogeneity had mixed user preferences.
Semantic Web related top conference reviewGong Cheng
The document summarizes key topics in semantic web and knowledge graph research from 2014-2017, including conferences, hot research areas, applications, and papers. It discusses trends such as increasing focus on knowledge graph applications, integration, and construction using techniques like neural networks. Notable news includes Google calling for dataset metadata and Wikidata creating its 31 millionth entity. The road ahead may involve greater knowledge graph commercialization, enrichment, and making knowledge graphs more accessible on the Web.
The document proposes a new approach called relatedness-based multi-entity summarization (MES) to generate concise summaries of related entities. It formulates MES as a quadratic multidimensional knapsack problem (QMKP) to select important and diverse intra-entity features while also selecting inter-entity features that indicate relatedness. It presents an algorithm called REMES based on the grasp heuristic to solve the QMKP formulation. A user study shows REMES outperforms other entity summarization methods at multi-entity summarization tasks.
Generating Illustrative Snippets for Open Data on the WebGong Cheng
We propose generating illustrative snippets from datasets to serve with metadata on dataset search engines. Currently, only metadata is shown. Snippets would help users understand the contents faster by covering important types and entities, using familiar entities, and keeping entities related. We formulate the snippet generation as a maximum-weight-and-coverage connected graph problem to optimize for these qualities. Experimental results show our snippets outperform baselines.
This document discusses summarizing semantic data, including entity descriptions, entity associations, and semantic datasets. It describes extractive and abstractive summarization methods. For entity descriptions, intrinsic metrics like frequency, centrality, informativeness, and diversity are used to rank property-value pairs for the summary. Extrinsic metrics also utilize external knowledge and context. Similar methods are applied to summarizing entity associations by ranking paths between entities. Summarizing semantic datasets involves selecting a representative subset of the data.
HIEDS: A Generic and Efficient Approach to Hierarchical Dataset SummarizationGong Cheng
This document summarizes the HIEDS approach to hierarchical dataset summarization. HIEDS aims to provide multigranular summaries that preserve dataset structure and are comprehensible. It models summarization as a multidimensional knapsack problem to maximize subgroup cohesion and moderateness while disallowing large overlap. HIEDS uses a greedy strategy for efficient solving but requires non-trivial implementation. Experiments show HIEDS outperforms the baseline by generating hierarchical rather than flat groups with better trade-offs and less redundancy.
Taking up the Gaokao Challenge: An Information Retrieval ApproachGong Cheng
This document describes an information retrieval approach for answering questions from the Gaokao, China's national college entrance exam. It retrieves relevant concept pages, quotes, and disambiguates terms from Wikipedia. It ranks pages based on centrality within vector spaces of words, links, and categories, filtering within relevant historical categories. It assesses answer options based on the extent the question and pages can entail each option. In experiments, it correctly answered 43.09% of questions answerable from Wikipedia and 31.28% of questions outside Wikipedia's scope.
Summarizing Entity Descriptions for Effective and Efficient Human-centered En...Gong Cheng
Presented at WWW'15, Florence.
Gong Cheng, Danyun Xu, Yuzhong Qu. Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking. In Proceedings of the 24th International World Wide Web Conference (WWW), pages 184--194, 2015.
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Gong Cheng
This document describes Explass, a system for exploring associations between entities via top-k ontological patterns and facets. It discusses challenges in exploring the over 1,000 associations within 4 hops in DBpedia and proposes two exploration methods: clustering associations into patterns and using entity/property classes as facets. The key steps involve mining significant patterns as frequent itemsets and selecting k patterns based on frequency, informativeness, and overlap. A demo of Explass on DBpedia is presented along with results of a user study comparing it to other approaches.
Facilitating Human Intervention in Coreference Resolution with Comparative En...Gong Cheng
The document presents a method for facilitating human intervention in coreference resolution by providing comparative entity summaries. It describes using properties and values of candidate coreferent entities to generate summaries that reflect their commonality and differences. The optimal summary maximizes commonality, difference, identity information and diversity, subject to a length limit. An evaluation involved human subjects verifying coreferent relationships for different summarization approaches. The comparative summary approach was found to improve verification efficiency without affecting accuracy as much as only showing common properties or entire descriptions.
Towards Exploratory Relationship Search: A Clustering-based ApproachGong Cheng
This document presents an approach for exploratory relationship search through hierarchical clustering. It aims to address the challenge of too many relationship search results by organizing them into a cluster hierarchy based on common relationship patterns. An evaluation with participants performing lookup and exploratory search tasks on DBpedia data found that the clustering approach outperformed simple listing and faceted categorization alternatives. User feedback suggested areas for improvement like more concise visualizations and cognitive support. The authors conclude it is a promising approach and future work could combine facets and clustering or explore alternatives.
The document describes the NanJing Vocabulary Repository (NJVR), a freely accessible collection of real-world vocabularies created by crawling over 4 billion RDF triples from thousands of domains. NJVR contains RDF descriptions of over 2,900 vocabularies identified from 261 domains as well as statistical data on their usage. It was constructed through crawling, vocabulary identification, and analysis of vocabulary instantiations. The goal is to provide a large test collection for research on topics like vocabulary ranking and matching.
BipRank: Ranking and Summarizing RDF Vocabulary DescriptionsGong Cheng
This document describes an approach called BipRank for ranking and summarizing RDF vocabulary descriptions. BipRank models the relationships between terms and sentences using a bipartite graph and ranks sentences based on their salience. It also considers the patterns of RDF sentences. The top-ranked sentences are then selected to generate a summary. Evaluation shows that BipRank correlates better with human judgments of sentence importance than prior work. It also generates summaries that experts rate as more relevant and cohesive.
An Empirical Study of Vocabulary Relatedness and Its Application to Recommend...Gong Cheng
This document discusses measuring vocabulary relatedness and its application to recommender systems. It presents 6 measures of vocabulary relatedness based on semantic relatedness, content similarity, expressivity closeness, and distributional relatedness. It analyzes the measures empirically using a dataset of 2,996 vocabularies and 4 billion RDF triples. The measures can be applied to post-selection vocabulary recommendation in vocabulary search tools.
RELIN: Relatedness and Informativeness-based Centrality for Entity SummarizationGong Cheng
This document presents RELIN, a model for entity summarization that ranks features based on relatedness and informativeness. RELIN extends PageRank by making the transition probabilities between features proportional to their relatedness and the amount of new information provided. It defines the relatedness between features based on the relatedness of their properties and values. Informativeness is measured by self-information which favors features that are less likely to co-occur. The model was implemented to rank features for entity summarization using these relatedness and informativeness measures.
MyView is a Linked Data browser designed for citizen users that allows for link traversal and filtering, customization of links and views, browsing history and bookmarks, online reasoning, result explanation, and coreference resolution. It was created by Gong Cheng of Nanjing University.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
Semantic Data Retrieval: Search, Ranking, and SummarizationGong Cheng
Gong Cheng presented on semantic data retrieval, including entity retrieval and association retrieval from semantic graphs. He discussed two main challenges: efficiently searching large graphs for associations within a diameter bound, and ranking the retrieved associations. For the first challenge, he proposed algorithms using path finding, pruning, and result deduplication. For the second challenge, he conducted a user study and found that association size was the most important ranking factor. Other proposed measures like entity homogeneity and relation heterogeneity had mixed user preferences.
Semantic Web related top conference reviewGong Cheng
The document summarizes key topics in semantic web and knowledge graph research from 2014-2017, including conferences, hot research areas, applications, and papers. It discusses trends such as increasing focus on knowledge graph applications, integration, and construction using techniques like neural networks. Notable news includes Google calling for dataset metadata and Wikidata creating its 31 millionth entity. The road ahead may involve greater knowledge graph commercialization, enrichment, and making knowledge graphs more accessible on the Web.
The document proposes a new approach called relatedness-based multi-entity summarization (MES) to generate concise summaries of related entities. It formulates MES as a quadratic multidimensional knapsack problem (QMKP) to select important and diverse intra-entity features while also selecting inter-entity features that indicate relatedness. It presents an algorithm called REMES based on the grasp heuristic to solve the QMKP formulation. A user study shows REMES outperforms other entity summarization methods at multi-entity summarization tasks.
Generating Illustrative Snippets for Open Data on the WebGong Cheng
We propose generating illustrative snippets from datasets to serve with metadata on dataset search engines. Currently, only metadata is shown. Snippets would help users understand the contents faster by covering important types and entities, using familiar entities, and keeping entities related. We formulate the snippet generation as a maximum-weight-and-coverage connected graph problem to optimize for these qualities. Experimental results show our snippets outperform baselines.
This document discusses summarizing semantic data, including entity descriptions, entity associations, and semantic datasets. It describes extractive and abstractive summarization methods. For entity descriptions, intrinsic metrics like frequency, centrality, informativeness, and diversity are used to rank property-value pairs for the summary. Extrinsic metrics also utilize external knowledge and context. Similar methods are applied to summarizing entity associations by ranking paths between entities. Summarizing semantic datasets involves selecting a representative subset of the data.
HIEDS: A Generic and Efficient Approach to Hierarchical Dataset SummarizationGong Cheng
This document summarizes the HIEDS approach to hierarchical dataset summarization. HIEDS aims to provide multigranular summaries that preserve dataset structure and are comprehensible. It models summarization as a multidimensional knapsack problem to maximize subgroup cohesion and moderateness while disallowing large overlap. HIEDS uses a greedy strategy for efficient solving but requires non-trivial implementation. Experiments show HIEDS outperforms the baseline by generating hierarchical rather than flat groups with better trade-offs and less redundancy.
Taking up the Gaokao Challenge: An Information Retrieval ApproachGong Cheng
This document describes an information retrieval approach for answering questions from the Gaokao, China's national college entrance exam. It retrieves relevant concept pages, quotes, and disambiguates terms from Wikipedia. It ranks pages based on centrality within vector spaces of words, links, and categories, filtering within relevant historical categories. It assesses answer options based on the extent the question and pages can entail each option. In experiments, it correctly answered 43.09% of questions answerable from Wikipedia and 31.28% of questions outside Wikipedia's scope.
Summarizing Entity Descriptions for Effective and Efficient Human-centered En...Gong Cheng
Presented at WWW'15, Florence.
Gong Cheng, Danyun Xu, Yuzhong Qu. Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking. In Proceedings of the 24th International World Wide Web Conference (WWW), pages 184--194, 2015.
Explass: Exploring Associations between Entities via Top-K Ontological Patter...Gong Cheng
This document describes Explass, a system for exploring associations between entities via top-k ontological patterns and facets. It discusses challenges in exploring the over 1,000 associations within 4 hops in DBpedia and proposes two exploration methods: clustering associations into patterns and using entity/property classes as facets. The key steps involve mining significant patterns as frequent itemsets and selecting k patterns based on frequency, informativeness, and overlap. A demo of Explass on DBpedia is presented along with results of a user study comparing it to other approaches.
Facilitating Human Intervention in Coreference Resolution with Comparative En...Gong Cheng
The document presents a method for facilitating human intervention in coreference resolution by providing comparative entity summaries. It describes using properties and values of candidate coreferent entities to generate summaries that reflect their commonality and differences. The optimal summary maximizes commonality, difference, identity information and diversity, subject to a length limit. An evaluation involved human subjects verifying coreferent relationships for different summarization approaches. The comparative summary approach was found to improve verification efficiency without affecting accuracy as much as only showing common properties or entire descriptions.
Towards Exploratory Relationship Search: A Clustering-based ApproachGong Cheng
This document presents an approach for exploratory relationship search through hierarchical clustering. It aims to address the challenge of too many relationship search results by organizing them into a cluster hierarchy based on common relationship patterns. An evaluation with participants performing lookup and exploratory search tasks on DBpedia data found that the clustering approach outperformed simple listing and faceted categorization alternatives. User feedback suggested areas for improvement like more concise visualizations and cognitive support. The authors conclude it is a promising approach and future work could combine facets and clustering or explore alternatives.
The document describes the NanJing Vocabulary Repository (NJVR), a freely accessible collection of real-world vocabularies created by crawling over 4 billion RDF triples from thousands of domains. NJVR contains RDF descriptions of over 2,900 vocabularies identified from 261 domains as well as statistical data on their usage. It was constructed through crawling, vocabulary identification, and analysis of vocabulary instantiations. The goal is to provide a large test collection for research on topics like vocabulary ranking and matching.
BipRank: Ranking and Summarizing RDF Vocabulary DescriptionsGong Cheng
This document describes an approach called BipRank for ranking and summarizing RDF vocabulary descriptions. BipRank models the relationships between terms and sentences using a bipartite graph and ranks sentences based on their salience. It also considers the patterns of RDF sentences. The top-ranked sentences are then selected to generate a summary. Evaluation shows that BipRank correlates better with human judgments of sentence importance than prior work. It also generates summaries that experts rate as more relevant and cohesive.
An Empirical Study of Vocabulary Relatedness and Its Application to Recommend...Gong Cheng
This document discusses measuring vocabulary relatedness and its application to recommender systems. It presents 6 measures of vocabulary relatedness based on semantic relatedness, content similarity, expressivity closeness, and distributional relatedness. It analyzes the measures empirically using a dataset of 2,996 vocabularies and 4 billion RDF triples. The measures can be applied to post-selection vocabulary recommendation in vocabulary search tools.
RELIN: Relatedness and Informativeness-based Centrality for Entity SummarizationGong Cheng
This document presents RELIN, a model for entity summarization that ranks features based on relatedness and informativeness. RELIN extends PageRank by making the transition probabilities between features proportional to their relatedness and the amount of new information provided. It defines the relatedness between features based on the relatedness of their properties and values. Informativeness is measured by self-information which favors features that are less likely to co-occur. The model was implemented to rank features for entity summarization using these relatedness and informativeness measures.
MyView is a Linked Data browser designed for citizen users that allows for link traversal and filtering, customization of links and views, browsing history and bookmarks, online reasoning, result explanation, and coreference resolution. It was created by Gong Cheng of Nanjing University.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
Embracing Deep Variability For Reproducibility and Replicability
Abstract: Reproducibility (aka determinism in some cases) constitutes a fundamental aspect in various fields of computer science, such as floating-point computations in numerical analysis and simulation, concurrency models in parallelism, reproducible builds for third parties integration and packaging, and containerization for execution environments. These concepts, while pervasive across diverse concerns, often exhibit intricate inter-dependencies, making it challenging to achieve a comprehensive understanding. In this short and vision paper we delve into the application of software engineering techniques, specifically variability management, to systematically identify and explicit points of variability that may give rise to reproducibility issues (eg language, libraries, compiler, virtual machine, OS, environment variables, etc). The primary objectives are: i) gaining insights into the variability layers and their possible interactions, ii) capturing and documenting configurations for the sake of reproducibility, and iii) exploring diverse configurations to replicate, and hence validate and ensure the robustness of results. By adopting these methodologies, we aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical aspects.
https://hal.science/hal-04582287
Efficient Algorithms for Association Finding and Frequent Association Pattern Mining
1. Efficient Algorithms for Association Finding and
Frequent Association Pattern Mining
Gong Cheng, Daxin Liu, Yuzhong Qu
Websoft Research Group
National Key Laboratory for Novel Software Technology
Nanjing University, China
Websoft
2. Background and motivation
• To suggest friends, recognize suspected terrorists,
answer questions … based on massive graph data
3. Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query entities, and
• is connected.
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
4. Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query entities, and
• is connected.
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
5. Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query entities, and
• is connected.
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
6. Problem statement
An association connecting a set of query entities is
a minimal subgraph that
• contains all the query entities, and
• is connected.
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
• Tree-structured
• Leaves ⊆ Query entities
7. Problem statement
1. How to efficiently find associations in a possibly
very large graph?
2. How to help users explore a possibly large set of
associations that have been found?
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
8. Problem statement
1. How to efficiently find associations in a possibly
very large graph?
2. How to help users explore a possibly large set of
associations that have been found?
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
Association finding
Frequent association
pattern mining
9. Association finding: Problem
• To find all the associations having a limited diameter
(Diameter = Greatest distance between any pair of vertices)
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
10. Association finding: Basic solution
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Chris
attended
Paper-AisAuthorOf
acceptedAt
ISWC
Alice
Bob
reviewer
ISWC
ISWC
An association A set of paths
11. Association finding: Basic solution
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Chris
attended
Paper-AisAuthorOf
acceptedAt
ISWC
Alice
Bob
reviewer
ISWC
ISWC
1. Path finding
2. Path merging
12. Association finding: Optimization
• Distance-based search space pruning
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
• DiameterConstraint ≤ 3
• Length(AliceDan) = 1
• Distance(Dan, Bob) = 4
Length+Distance > DiameterConstraint
13. Association finding: Optimization
• Distance computation
• Materializing offline computed results: O(V2) space
• Online computing: O(E) time per pair
• Using distance oracle: a space-time trade-off
Chris
Bob
Paper-A
Paper-B
Dan
isAuthorOf
knows
correspondingAuthor
acceptedAt
Ellen
knows
ISWC
isAuthorOf
COLD
acceptedAt
attended
attended
attended
reviewer
reviewer
Frank
knows
Alice
16. Association finding: Deduplication
• Canonical code
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper-A))$
= … Paper-A,acceptedAt,code(Tree(ISWC))$ …
= … ISWC,reviewer,code(Tree(Bob)),~attended,code(Tree(Chris))$ …
= … Bob$ …
(Assuming Bob precedes Chris)
17. Frequent association pattern mining
• Association pattern: A conceptual abstract that
summarizes a group of associations
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Association Association pattern
(Intermediate entity Class)
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference attended
reviewer
Alice
18. Frequent association pattern mining
• Problem: To mine all the association patterns matched by
more than a threshold proportion of associations
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Association Association pattern
(Intermediate entity Class)
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference attended
reviewer
Alice
19. Frequent association pattern mining
• Basic solution:
Chris
Bob
Paper-AisAuthorOf
acceptedAt
ISWC attended
reviewer
Alice
Association Association pattern
(Intermediate entity Class)
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference attended
reviewer
Alice
Calculating the frequency of an association pattern
= Counting the occurrence of its canonical code
20. Frequent association pattern mining
• Canonical code
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference
attended
reviewer
Alice Paper
isAuthorOf acceptedAt
Conference
code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper)),isAuthorOf,code(Tree(Paper))$
Paper equals Paper.
So which subtree should go first?
(Canonical code may not be unique!)?
?
21. Frequent association pattern mining
• Canonical code
Chris
Bob
PaperisAuthorOf
acceptedAt
Conference
attended
reviewer
Alice Paper
isAuthorOf acceptedAt
Conference
code(Tree(Alice)) = Alice,isAuthorOf,code(Tree(Paper)),isAuthorOf,code(Tree(Paper))$
Smallest leaf as its proxy to be compared
Equality would never happen.
(Canonical code is now unique!)
(Assuming Bob precedes Chris)
22. Experiments
• Datasets
• LinkedMDB: 1M vertices and 2M arcs
• DBpedia (2015-04): 4M vertices and 15M arcs
• Parameter settings
• Diameter constraint (λ): 2, 4
• Number of query entities (n): 2, 3, 4, 5
• Test queries
• 1,000 random sets of query entities under each setting of λ and n
• Hardware configuration
• 3.3GHz CPU, 24GB memory
• Data graphs: in memory
• Distance oracles: on disk
23. Experiments
• Results: Association finding
BSC: Basic solution (not pruning)
PRN: Optimized solution (distance-based pruning)
PRN-1: Optimized solution (distance-based pruning except for the last level of search)
25. Takeaway messages
• Subgraph finding and mining are faster than what we expected.
• Consider distance oracle and canonical code in your own research.
26. Takeaway messages
• Subgraph finding and mining are faster than what we expected.
• Consider distance oracle and canonical code in your own research.