The document discusses higher-order link prediction, which aims to predict the formation of new groups or "simplices" containing more than two nodes, based on structural properties in timestamped simplex data from various domains. It finds that predicting the closure of open triangles (where a pair of nodes have interacted but not with the third) performs well, and that simply averaging the edge weights in a triangle is often a good predictor. Predicting new structures in communication, collaboration and proximity networks can provide insights beyond classical link prediction.
Presentation given at the International Conference on Scalable Uncertainty Management, 2-5 Oct 2018, Milan, Italy.
Paper: https://research.utwente.nl/en/publications/rule-based-conditioning-of-probabilistic-data
Data interoperability is a major issue in data management for data science and big data analytics. Probabilistic data integration (PDI) is a specific kind of data integration where extraction and integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. This allows a data integration process with two phases: (1) a quick partial integration where data quality problems are represented as uncertainty in the resulting integrated data, and (2) using the uncertain data and continuously improving its quality as more evidence is gathered. The main contribution of this paper is an iterative approach for incorporating evidence of users in the probabilistically integrated data. Evidence can be specified as hard or soft rules (i.e., rules that are uncertain themselves).
Presentation given at the International Conference on Scalable Uncertainty Management, 2-5 Oct 2018, Milan, Italy.
Paper: https://research.utwente.nl/en/publications/rule-based-conditioning-of-probabilistic-data
Data interoperability is a major issue in data management for data science and big data analytics. Probabilistic data integration (PDI) is a specific kind of data integration where extraction and integration problems such as inconsistency and uncertainty are handled by means of a probabilistic data representation. This allows a data integration process with two phases: (1) a quick partial integration where data quality problems are represented as uncertainty in the resulting integrated data, and (2) using the uncertain data and continuously improving its quality as more evidence is gathered. The main contribution of this paper is an iterative approach for incorporating evidence of users in the probabilistically integrated data. Evidence can be specified as hard or soft rules (i.e., rules that are uncertain themselves).
We present a linear regression method for predictions on a small data
set making use of a second possibly biased data set that may be much
larger. Our method ts linear regressions to the two data sets while
penalizing the dierence between predictions made by those two models.
The visual integrity needs to be implemented in sending a picture. There is various image received have no originality. The small change of the pixels does not make the picture content detected by the eye. The integrity validation is very important to be applied. The picture captured by a camera has two dimensions. It is described in pixels such as Width and Length. This study is to validate all the pixels data or the color intensity of both dimensions. If there are a modification in the pixel, this method will give the wrong hash data. The validator will analyze the pixels in every layer such as red, green and blue to ensure the data transmitted is correct. Once there is a slight change in the pixels, the calculation gives the wrong value. It is very useful to compare the image before and after transmission.
In a distributed medical system, building cross-site records while maintaining appropriate patients anonymity is essential. The distributed databases contain information about the same individuals, often described by using the same variables, which do not fit quite frequently due to accidental distortions. In such cases, the record linkage methods are used to find records that correspond to the same individuals in order to create a consistent database.
Our goal was to find a solution for this problem. In this paper, we propose an anonymous identifier, based on combinations of first two letters from the surname, name, date of birth and gender, which can allow a deidentifying merged dataset from multiple databases of a distributed medical system.
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
Data mining is a task in which data is extracted from the large database to make itin an understandable
form or structure so that it can be used for further use. In this paper we present an approach by which the
concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the
desired algorithm like hierarichal clustering, algorithms for finding the minimum closest cluster. In this
paper wealso explain the two party computations. Privacy of any data is the most important thing in these
days hence we present an approach by which we can apply privacy preservation over the two party which
are distributing their data horizontally. We also explain about the hierarichal clustering which we are
going to apply in our present method.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Journals
Abstract In today’s world everything is done digitally and so we have lots of data raw data. This data are useful to predict future events if we proper use it. Clustering is such a technique where we put closely related data together. Furthermore we have types of data sequential, interval, categorical etc. In this paper we have shown what is the problem with clustering categorical data with rough set and who we can overcome with improvement.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Waqas Tariq
Several algorithms and techniques have been proposed in recent years for the publication of sensitive microdata. However, there is a trade-off to be considered between the level of privacy offered and the usefulness of the published data. Recently, slicing was proposed as a novel technique for increasing the utility of an anonymized published dataset by partitioning the dataset vertically and horizontally. This work proposes a novel technique to increase the utility of a sliced dataset even further by allowing overlapped clustering while maintaining the prevention of membership disclosure. It is further shown that using an alternative algorithm to Mondrian increases the efficiency of slicing. This paper shows though workload experiments that these improvements help preserve data utility better than traditional slicing.
We present a linear regression method for predictions on a small data
set making use of a second possibly biased data set that may be much
larger. Our method ts linear regressions to the two data sets while
penalizing the dierence between predictions made by those two models.
The visual integrity needs to be implemented in sending a picture. There is various image received have no originality. The small change of the pixels does not make the picture content detected by the eye. The integrity validation is very important to be applied. The picture captured by a camera has two dimensions. It is described in pixels such as Width and Length. This study is to validate all the pixels data or the color intensity of both dimensions. If there are a modification in the pixel, this method will give the wrong hash data. The validator will analyze the pixels in every layer such as red, green and blue to ensure the data transmitted is correct. Once there is a slight change in the pixels, the calculation gives the wrong value. It is very useful to compare the image before and after transmission.
In a distributed medical system, building cross-site records while maintaining appropriate patients anonymity is essential. The distributed databases contain information about the same individuals, often described by using the same variables, which do not fit quite frequently due to accidental distortions. In such cases, the record linkage methods are used to find records that correspond to the same individuals in order to create a consistent database.
Our goal was to find a solution for this problem. In this paper, we propose an anonymous identifier, based on combinations of first two letters from the surname, name, date of birth and gender, which can allow a deidentifying merged dataset from multiple databases of a distributed medical system.
TWO PARTY HIERARICHAL CLUSTERING OVER HORIZONTALLY PARTITIONED DATA SETIJDKP
Data mining is a task in which data is extracted from the large database to make itin an understandable
form or structure so that it can be used for further use. In this paper we present an approach by which the
concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the
desired algorithm like hierarichal clustering, algorithms for finding the minimum closest cluster. In this
paper wealso explain the two party computations. Privacy of any data is the most important thing in these
days hence we present an approach by which we can apply privacy preservation over the two party which
are distributing their data horizontally. We also explain about the hierarichal clustering which we are
going to apply in our present method.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
An enhanced fuzzy rough set based clustering algorithm for categorical dataeSAT Journals
Abstract In today’s world everything is done digitally and so we have lots of data raw data. This data are useful to predict future events if we proper use it. Clustering is such a technique where we put closely related data together. Furthermore we have types of data sequential, interval, categorical etc. In this paper we have shown what is the problem with clustering categorical data with rough set and who we can overcome with improvement.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Waqas Tariq
Several algorithms and techniques have been proposed in recent years for the publication of sensitive microdata. However, there is a trade-off to be considered between the level of privacy offered and the usefulness of the published data. Recently, slicing was proposed as a novel technique for increasing the utility of an anonymized published dataset by partitioning the dataset vertically and horizontally. This work proposes a novel technique to increase the utility of a sliced dataset even further by allowing overlapped clustering while maintaining the prevention of membership disclosure. It is further shown that using an alternative algorithm to Mondrian increases the efficiency of slicing. This paper shows though workload experiments that these improvements help preserve data utility better than traditional slicing.
Presentation for Network Biology SIG 2013 by Gang Su, University of Michigan, USA. “CoolMap Cytoscape App: Flexible Multi-scale Heatmap-Driven Molecular Network Exploration”
When spatial data are distributed across multiple servers, there is an obvious difficulty with computing the likelihood function without combining all the data onto one server. Therefore, it would be of interest to compute estimates of the spatial parameters based on decompositions of the spatial held into blocks, each block corresponding to one server. Two methods suggest themselves, a \between blocks" approach in which each block is reduced to a single observation (or a low dimensional summary) to facilitate calculation of a likelihood across blocks, or a within blocks" approach in which the likelihood is calculated for each block and then combined into an overall likelihood for the full process. In fact, I argue that a hybrid approach that combines both ideas is best. Theoretical calculations are provided for the statistical efficiency of each approach. In conclusion, I will present some thoughts for optimal sampling designs with distributed data.
Research on Haberman dataset also business required documentManjuYadav65
Here it was to do analysis on Haberman dataset and apply visualization (EDA) on it in R language to understand the data and the outcomes on people suffering from Breast cancer and how pattern were there to analyze how time and experience helped in the survival of the patient and also wrong detection would cause in loosing life of the patient. Giving a verdict as to what happens next..
Gaining Confidence in Signalling and Regulatory NetworksMichael Stumpf
Mathematical models of signalling and gene regulatory systems are abstractions of much more complicated processes. Even as more and larger data sets are becoming available we are not be able to dispense entirely with mechanistic models of real-world processes; nor should we. However, trying to develop informative and realistic models of such systems typically involves suitable statistical inference methods, domain expertise and a modicum of luck. Except
for cases where physical principles provide sucient guidance it will also be generally possible to come up with a large number of potential models that are compatible with a given biological system and any finite amount of data generated from experiments on that system.
Here I will discuss how we can systematically evaluate
potentially vast sets of mechanistic candidate models in light
of experimental and prior knowledge about biological systems. This enables us to evaluate quantitatively
the dependence of model inferences and predictions on the assumed model structures. Failure to consider the impact of structural uncertainty introduces biases into the analysis and potentially gives rise to misleading conclusions.
Modeling of neural image compression using gradient decent technologytheijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Theoretical work submitted to the Journal should be original in its motivation or modeling structure. Empirical analysis should be based on a theoretical framework and should be capable of replication. It is expected that all materials required for replication (including computer programs and data sets) should be available upon request to the authors.
The International Journal of Engineering & Science would take much care in making your article published without much delay with your kind cooperation
Similar to Simplicial closure and higher-order link prediction --- SIAMNS18 (20)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Simplicial closure and higher-order link prediction --- SIAMNS18
1. 1
Joint work with
Rediet Abebe & Jon Kleinberg (Cornell)
Michael Schaub & Ali Jadbabaie (MIT)
Simplicial closure and higher-order link prediction
Austin R. Benson · Cornell
SIAM NS · July 13, 2018
Paper. arXiv 1802.06916
Data. bit.ly/sc-holp-data
Code. bit.ly/sc-holp-code
Slides. bit.ly/arb-SIAMAN18
2. Canonical networks are everywhere slide,
but with a hidden purpose!
2
Collaboration
nodes are people/groups
edges link entities
working together
Communications
nodes are people/accounts
edges show info.exchange
Physical proximity
nodes are people/animals
edges link those that interact
in close proximity
Drug compounds
nodes are substances
edge between substances that
appear in the same drug
3. Real-world systems are composed of“higher-order”
interactions that we often reduce to pairwise ones.
3
Collaboration
nodes are people/groups
teams are made up of
small groups
Communications
nodes are people/accounts
emails often have several
recipients,not just one
Physical proximity
nodes are people/animals
people often gather in
small groups
Drug compounds
nodes are substances
drugs are made up of
several substances
4. There are many ways to mathematically represent the
higher-order structure present in relational data.
4
• Hypergraphs [Berge 89]
• Set systems [Frankl 95]
• Affiliation networks [Feld 81,Newman-Watts-Strogatz 02]
• Multipartite networks [Lambiotte-Ausloos 05,Lind-Herrmann 07]
• Abstract simplicial complexes [Lim 15,Osting-Palande-Wang 17]
• Multilayer networks [Kivelä+ 14,Boccaletti+ 14,many others…]
• Meta-paths [Sun-Han 12]
• Motif-based representations [Benson-Gleich-Leskovec 15,17]
• …
⟶ All lead to their own algorithms and methods for understanding dynamics.
Data representation is not the problem.
The problem is how do we evaluate and compare models and algorithms?
5. We propose“higher-order link prediction”as a similar
framework for evaluation of higher-order models.
5
t1 : {1, 2, 3, 4}
t2 : {1, 3, 5}
t3 : {1, 6}
t4 : {2, 6}
t5 : {1, 7, 8}
t6 : {3, 9}
t7 : {5, 8}
t8 : {1, 2, 6}
Data.
Observe simplices up to some
time t. Using this data, we want
to predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
Possible applications
• Novel combinations of drugs for treatments.
• Group chat recommendation in social networks.
• Team formation.
6. We collected many datasets of timestamped simplices,
where each simplex is a subset of nodes.
6
1. Coauthorship in different domains.
2. Emails with multiple recipients.
3. Tags on Q&A forums.
4. Threads on Q&A forums.
5. Contact/proximity measurements.
6. Musical artist collaboration.
7. Substance makeup and
classification codes applied to
drugs the FDA examines.
8. U.S. Congress committee
memberships and bill sponsorship.
9. Combinations of drugs seen in
patients in ER visits. https://math.stackexchange.com/q/80181
bit.ly/sc-holp-data
7. Thinking of higher-order data as a weighted projected
graph with“filled-in”structures is a convenient viewpoint.
7
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
2
2
1
1
1
1
1
1
1
1
1
1
1
1
1
t1 : {1, 2, 3, 4}
t2 : {1, 3, 5}
t3 : {1, 6}
t4 : {2, 6}
t5 : {1, 7, 8}
t6 : {3, 9}
t7 : {5, 8}
t8 : {1, 2, 6}
Data. Pictures to have in mind.
Projected graph W.
Wij = # of simplices containing nodes i and j.
9. 9
i
j k
i
j k
Warm-up. What’s more common in data?
or
“Open triangle”
each pair has been in a simplex
together but all 3 nodes have
never been in the same simplex
“Closed triangle”
there is some simplex that
contains all 3 nodes
11. We still have variety with only 3-node simplices.
11
music-rap-genius
NDC-substances
NDC-classes
DAWN
coauth-DBLP
coauth-MAG-geology
coauth-MAG-history
congress-bills
congress-committees
tags-stack-overflow
tags-math-sx
tags-ask-ubuntu
email-Eu
email-Enron
threads-stack-overflow
threads-math-sx
threads-ask-ubuntu
contact-high-school
contact-primary-school
10 5
10 4
10 3
10 2
10 1
Edge density in projected graph
0.00
0.25
0.50
0.75
1.00
Fractionoftrianglesopen
Exactly 3 nodes per simplex
12. A simple model can account for open triangle variation.
12
• n nodes; only 3-node simplices; {i, j, k} included with prob. p = 1 / nb i.i.d.
• ⟶ always get ϴ(pn3) = ϴ(n3 - b) closed triangles in expectation.
b = 0.8, 0.82, 0.84, ..., 1.8
Larger b is darker marker.
Proposition (rough).
• b < 1 ⟶ ϴ(n3) open triangles in
expectation for large n.
• b > 1 ⟶ ϴ(n3(2-b)) open triangles
in expectation for large n.
• The number of open triangles
grows faster for b < 3/2.
10 1
100
Edge density in projected graph
0.00
0.25
0.50
0.75
1.00
Fractionoftrianglesopen
Exactly 3 nodes per simplex (simulated)
n = 200
n = 100
n = 50
n = 25
13. Dataset domain separation also occurs at the local level.
13
• Randomly sample 100 egonets per dataset and measure
log of average degree and fraction of open triangles.
• Logistic regression model to predict domain
(coauthorship, tags, threads, email, contact).
• 75% model accuracy vs. 21% with random guessing.
15. Groups of nodes go through trajectories until finally
reaching a“simplicial closure event.”
15
1
2
3
4
5
6
7
8
9
t1 : {1, 2, 3, 4}
t2 : {1, 3, 5}
t3 : {1, 6}
t4 : {2, 6}
t5 : {1, 7, 8}
t6 : {3, 9}
t7 : {5, 8}
t8 : {1, 2, 6}
1
2 6
1
2 6
1
2 6
1
2 6
1
2 6
{1, 2, 3, 4}
t1
{1, 6}
t3
{2, 6}
t4
{1, 2, 6}
t8
For this talk, we will focus on simplicial closure on 3 nodes.
16. Groups of nodes go through trajectories until finally
reaching a“simplicial closure event.”
16
Substances in marketed drugs recorded in the National Drug Code directory.
HIV protease
inhibitors
UGT1A1
inhibitors
Breast cancer
resistance protein inhibitors
1
2+
2+
1
2+
1
1
2+
2+
1
2+
2+
2+
Reyataz
RedPharm
2003
Reyataz
Squibb & Sons
2003
Kaletra
Physicians
Total Care
2006
Promacta
GSK (25mg)
2008
Promacta
GSK (50mg)
2008
Kaletra
DOH Central
Pharmacy
2009
Evotaz
Squibb & Sons
2015
We bin weighted edges into “weak” and “strong ties” in the projected graph W.
Wij = # of simplices containing nodes i and j.
• Weak ties. Wij = 1 (one simplex contains i and j)
• Strong ties. Wij > 2 (at least two simplices contain i and j)
18. Simplicial closure depends on structure in projected graph.
18
• First 80% of the data (in time) ⟶ record configurations of triplets not in closed triangle.
• Remainder of data ⟶ find fraction that are now closed triangles.
Increased edge density
increases closure probability.
Increased tie strength
increases closure probability.
Tension between edge
density and tie strength.
Left and middle observations are consistent with theory and empirical studies of social networks.
[Granovetter 73; Leskovec+ 08; Backstrom+ 06; Kossinets-Watts 06]
Closure probability Closure probability Closure probability
20. We propose“higher-order link prediction”as a similar
framework for evaluation of higher-order models.
20
t1 : {1, 2, 3, 4}
t2 : {1, 3, 5}
t3 : {1, 6}
t4 : {2, 6}
t5 : {1, 7, 8}
t6 : {3, 9}
t7 : {5, 8}
t8 : {1, 2, 6}
Data.
Observe simplices up to some
time t. Using this data, want to
predict what groups of > 2
nodes will appear in a simplex
in the future.
t
1
2
3
4
5
6
7
8
9
We predict structure that classical link
prediction would not even consider!
21. 21
Our structural analysis tells us what we should be
looking at for prediction.
1. Edge density is a positive indicator.
⟶ focus our attention on predicting which open
triangles become closed triangles.
2. Tie strength is a positive indicator.
⟶ various ways of incorporating this information
i
j k
Wij
Wjk
Wjk
22. 22
For every open triangle,we assign a score function on
first 80% of data based on structural properties.
Four broad classes of score functions for an open triangle.
Score s(i, j, k)…
1. is a function of Wij, Wjk, Wjk
arithmetic mean, geometric mean, etc.
2. looks at common neighbors of the three nodes.
generalized Jaccard, Adamic-Adar, etc.
3. uses “whole-network” similarity scores on projected graph
sum of PageRank or Katz scores amongst edges
4. is learned from data
train a logistic regression model with features
i
j k
Wij
Wjk
Wjk
After computing scores, predict that open triangles with highest
scores will be closed triangles in final 20% of data.
24. 24
A few lessons learned from applying all of these ideas.
1. We can predict pretty well on all datasets using some method.
→ 4x to 107x better than random w/r/t mean average precision
depending on the dataset/method
2. Thread co-participation and co-tagging on stack exchange are
consistently easy to predict.
3. Simply averaging Wij, Wjk, and Wik consistently performs well.
i
j k
Wij
Wjk
Wjk
25. Generalized means of edges weights are often good
predictors of new 3-node simplices appearing.
25
music-rap-genius
NDC-substances
NDC-classes
DAWN
coauth-DBLP
coauth-MAG-geology
coauth-MAG-history
congress-bills
congress-committees
tags-stack-overflow
tags-math-sx
tags-ask-ubuntu
email-Eu
email-Enron
threads-stack-overflow
threads-math-sx
threads-ask-ubuntu
contact-high-school
contact-primary-school
harmonic geometric arithmetic
p
4 3 2 1 0 1 2 3 4
0
20
40
60
80
Relativeperformance
4 3 2 1 0 1 2 3 4
p
2.5
5.0
7.5
10.0
12.5
Relativeperformance
4 3 2 1 0 1 2 3 4
p
1.0
1.5
2.0
2.5
3.0
3.5
Relativeperformance
Good performance from this local information is a deviation from classical link prediction,where
methods that use long paths (e.g.,PageRank) perform well [Liben-Nowell & Kleinberg 07].
For structures on k nodes,the subsets of size k-1 contain rich information only when k > 2.
i
j k
Wij
Wjk
Wjk
i
j k
?
scorep(i, j, k)
= (Wp
ij + Wp
jk + Wp
ik)1/p
<latexit sha1_base64="wECyDT1irjpegMdv/Iox6i4U4iU=">AAAHdXicfVVtb9s2EFa7Lem0t7T7OAxglzlIO/ktXZZkQAADK4oVa7FsdpoCoZtR0sliTEoqSTV2Cf2o/Zph37Zfsa872k5jOdkI2DqR99zDu3tIhYXg2nQ6f966/d77H6yt3/nQ/+jjTz79bOPuvRc6L1UEx1EucvUyZBoEz+DYcCPgZaGAyVDASTj+wa2fvAGleZ4NzLSAoWSjjCc8YganzjZ+2qIGJsbqKFdQnRXbPCDnARk/IJT6W/R1yWJySLZPziw/r14V5BuC5vn4ncmd+eCV7baL6mxjs9PqzAa5bnQXxqa3GEdnd9fu0TiPSgmZiQTT+rTbKczQMmV4JKDyaamhYNGYjeAUzYxJ0EM7y7oiDZyJSZIr/GWGzGb9ZQjGUWxai2INC0vB1KQ+G+b5GFd05dcpTbI/tDwrSgNZNGdMSkFMTlwtScwVREZMSZ3W8PHbIOMRJIpFAZNaMpMGBXfbDMz4bXOkWJEGko0hAiGupuabcnDBQ8XU1GWQX+ggxMgjlZdZrIOCGQMq04g3ik8CnbICdJBwE0RMRO49dphC5EYyNdb/FbUlwTBcnBVOgLGDMjHwK8SVVRDf3+/cDwXyLnuYFEYKIKvs7OF8LlJuYMUnFCVU1v0vefgNkhpT6O/bbVRcSxuMDZMoZdkIWlEu269L0E6Uut39bvdg56CtQXLUbohSlc0LbtKmS6LJs2aICgc183u0tzl/+NQVlOEJcPXx6UjkIRMUX6mD9SDTpYJenAvsfw/1H+UxHFIFgk0usTluvq6h00F3aF3jnABqXT4a9FnmiqsggwtMQLIstjRhkotpDAkrhaks1cmlXReJTpwqKr+xTKaxgxAfdloHQSQ5kqIsBCoeCcxEJy5EPUmMTTMzcaF6c7DVD0/xqO0Oq9WkHgOeMQX9qQxz8QRTsvMourI/P39W2cxRSF5ZWVmO26V9MDc540S8CgkXkAWHA/TLENtpStfSmwlWGfpPnruSXBIMurXy2XBSWS2uSJzzHG2foqerARNFyqqrrf72dKXq8UgAj9LmvPY3rWCjNd4u9etBujDLXZZ9PpLIROeqcuEsDaWl8/nqmizkM7yT45sQi4WqTvGQTkKmTlF8NA3ziaVv3H/Dp6kqBZAU+Cg1eLnu7RaGNMggBcIiUzJBEObTMd4QndbOLkwa5HI0yGP8nrAsAhKCucDz63wJkhE9K6M/p2r4hMwCNDutLsjGJbqf5gqrw7MRyTOCoiICEkM0j8EhlvLa7FbvguD9/+h/g6hZJrMolasCfkW6q9+M68aLnVYXt/fLt5u9/cX35I73hfeVt+11vT2v5/3oHXnHXuT97v3h/eX9vfbP+pfrX69vzV1v31pgPvdqY739L4qSnVk=</latexit><latexit sha1_base64="wECyDT1irjpegMdv/Iox6i4U4iU=">AAAHdXicfVVtb9s2EFa7Lem0t7T7OAxglzlIO/ktXZZkQAADK4oVa7FsdpoCoZtR0sliTEoqSTV2Cf2o/Zph37Zfsa872k5jOdkI2DqR99zDu3tIhYXg2nQ6f966/d77H6yt3/nQ/+jjTz79bOPuvRc6L1UEx1EucvUyZBoEz+DYcCPgZaGAyVDASTj+wa2fvAGleZ4NzLSAoWSjjCc8YganzjZ+2qIGJsbqKFdQnRXbPCDnARk/IJT6W/R1yWJySLZPziw/r14V5BuC5vn4ncmd+eCV7baL6mxjs9PqzAa5bnQXxqa3GEdnd9fu0TiPSgmZiQTT+rTbKczQMmV4JKDyaamhYNGYjeAUzYxJ0EM7y7oiDZyJSZIr/GWGzGb9ZQjGUWxai2INC0vB1KQ+G+b5GFd05dcpTbI/tDwrSgNZNGdMSkFMTlwtScwVREZMSZ3W8PHbIOMRJIpFAZNaMpMGBXfbDMz4bXOkWJEGko0hAiGupuabcnDBQ8XU1GWQX+ggxMgjlZdZrIOCGQMq04g3ik8CnbICdJBwE0RMRO49dphC5EYyNdb/FbUlwTBcnBVOgLGDMjHwK8SVVRDf3+/cDwXyLnuYFEYKIKvs7OF8LlJuYMUnFCVU1v0vefgNkhpT6O/bbVRcSxuMDZMoZdkIWlEu269L0E6Uut39bvdg56CtQXLUbohSlc0LbtKmS6LJs2aICgc183u0tzl/+NQVlOEJcPXx6UjkIRMUX6mD9SDTpYJenAvsfw/1H+UxHFIFgk0usTluvq6h00F3aF3jnABqXT4a9FnmiqsggwtMQLIstjRhkotpDAkrhaks1cmlXReJTpwqKr+xTKaxgxAfdloHQSQ5kqIsBCoeCcxEJy5EPUmMTTMzcaF6c7DVD0/xqO0Oq9WkHgOeMQX9qQxz8QRTsvMourI/P39W2cxRSF5ZWVmO26V9MDc540S8CgkXkAWHA/TLENtpStfSmwlWGfpPnruSXBIMurXy2XBSWS2uSJzzHG2foqerARNFyqqrrf72dKXq8UgAj9LmvPY3rWCjNd4u9etBujDLXZZ9PpLIROeqcuEsDaWl8/nqmizkM7yT45sQi4WqTvGQTkKmTlF8NA3ziaVv3H/Dp6kqBZAU+Cg1eLnu7RaGNMggBcIiUzJBEObTMd4QndbOLkwa5HI0yGP8nrAsAhKCucDz63wJkhE9K6M/p2r4hMwCNDutLsjGJbqf5gqrw7MRyTOCoiICEkM0j8EhlvLa7FbvguD9/+h/g6hZJrMolasCfkW6q9+M68aLnVYXt/fLt5u9/cX35I73hfeVt+11vT2v5/3oHXnHXuT97v3h/eX9vfbP+pfrX69vzV1v31pgPvdqY739L4qSnVk=</latexit><latexit sha1_base64="wECyDT1irjpegMdv/Iox6i4U4iU=">AAAHdXicfVVtb9s2EFa7Lem0t7T7OAxglzlIO/ktXZZkQAADK4oVa7FsdpoCoZtR0sliTEoqSTV2Cf2o/Zph37Zfsa872k5jOdkI2DqR99zDu3tIhYXg2nQ6f966/d77H6yt3/nQ/+jjTz79bOPuvRc6L1UEx1EucvUyZBoEz+DYcCPgZaGAyVDASTj+wa2fvAGleZ4NzLSAoWSjjCc8YganzjZ+2qIGJsbqKFdQnRXbPCDnARk/IJT6W/R1yWJySLZPziw/r14V5BuC5vn4ncmd+eCV7baL6mxjs9PqzAa5bnQXxqa3GEdnd9fu0TiPSgmZiQTT+rTbKczQMmV4JKDyaamhYNGYjeAUzYxJ0EM7y7oiDZyJSZIr/GWGzGb9ZQjGUWxai2INC0vB1KQ+G+b5GFd05dcpTbI/tDwrSgNZNGdMSkFMTlwtScwVREZMSZ3W8PHbIOMRJIpFAZNaMpMGBXfbDMz4bXOkWJEGko0hAiGupuabcnDBQ8XU1GWQX+ggxMgjlZdZrIOCGQMq04g3ik8CnbICdJBwE0RMRO49dphC5EYyNdb/FbUlwTBcnBVOgLGDMjHwK8SVVRDf3+/cDwXyLnuYFEYKIKvs7OF8LlJuYMUnFCVU1v0vefgNkhpT6O/bbVRcSxuMDZMoZdkIWlEu269L0E6Uut39bvdg56CtQXLUbohSlc0LbtKmS6LJs2aICgc183u0tzl/+NQVlOEJcPXx6UjkIRMUX6mD9SDTpYJenAvsfw/1H+UxHFIFgk0usTluvq6h00F3aF3jnABqXT4a9FnmiqsggwtMQLIstjRhkotpDAkrhaks1cmlXReJTpwqKr+xTKaxgxAfdloHQSQ5kqIsBCoeCcxEJy5EPUmMTTMzcaF6c7DVD0/xqO0Oq9WkHgOeMQX9qQxz8QRTsvMourI/P39W2cxRSF5ZWVmO26V9MDc540S8CgkXkAWHA/TLENtpStfSmwlWGfpPnruSXBIMurXy2XBSWS2uSJzzHG2foqerARNFyqqrrf72dKXq8UgAj9LmvPY3rWCjNd4u9etBujDLXZZ9PpLIROeqcuEsDaWl8/nqmizkM7yT45sQi4WqTvGQTkKmTlF8NA3ziaVv3H/Dp6kqBZAU+Cg1eLnu7RaGNMggBcIiUzJBEObTMd4QndbOLkwa5HI0yGP8nrAsAhKCucDz63wJkhE9K6M/p2r4hMwCNDutLsjGJbqf5gqrw7MRyTOCoiICEkM0j8EhlvLa7FbvguD9/+h/g6hZJrMolasCfkW6q9+M68aLnVYXt/fLt5u9/cX35I73hfeVt+11vT2v5/3oHXnHXuT97v3h/eX9vfbP+pfrX69vzV1v31pgPvdqY739L4qSnVk=</latexit><latexit sha1_base64="wECyDT1irjpegMdv/Iox6i4U4iU=">AAAHdXicfVVtb9s2EFa7Lem0t7T7OAxglzlIO/ktXZZkQAADK4oVa7FsdpoCoZtR0sliTEoqSTV2Cf2o/Zph37Zfsa872k5jOdkI2DqR99zDu3tIhYXg2nQ6f966/d77H6yt3/nQ/+jjTz79bOPuvRc6L1UEx1EucvUyZBoEz+DYcCPgZaGAyVDASTj+wa2fvAGleZ4NzLSAoWSjjCc8YganzjZ+2qIGJsbqKFdQnRXbPCDnARk/IJT6W/R1yWJySLZPziw/r14V5BuC5vn4ncmd+eCV7baL6mxjs9PqzAa5bnQXxqa3GEdnd9fu0TiPSgmZiQTT+rTbKczQMmV4JKDyaamhYNGYjeAUzYxJ0EM7y7oiDZyJSZIr/GWGzGb9ZQjGUWxai2INC0vB1KQ+G+b5GFd05dcpTbI/tDwrSgNZNGdMSkFMTlwtScwVREZMSZ3W8PHbIOMRJIpFAZNaMpMGBXfbDMz4bXOkWJEGko0hAiGupuabcnDBQ8XU1GWQX+ggxMgjlZdZrIOCGQMq04g3ik8CnbICdJBwE0RMRO49dphC5EYyNdb/FbUlwTBcnBVOgLGDMjHwK8SVVRDf3+/cDwXyLnuYFEYKIKvs7OF8LlJuYMUnFCVU1v0vefgNkhpT6O/bbVRcSxuMDZMoZdkIWlEu269L0E6Uut39bvdg56CtQXLUbohSlc0LbtKmS6LJs2aICgc183u0tzl/+NQVlOEJcPXx6UjkIRMUX6mD9SDTpYJenAvsfw/1H+UxHFIFgk0usTluvq6h00F3aF3jnABqXT4a9FnmiqsggwtMQLIstjRhkotpDAkrhaks1cmlXReJTpwqKr+xTKaxgxAfdloHQSQ5kqIsBCoeCcxEJy5EPUmMTTMzcaF6c7DVD0/xqO0Oq9WkHgOeMQX9qQxz8QRTsvMourI/P39W2cxRSF5ZWVmO26V9MDc540S8CgkXkAWHA/TLENtpStfSmwlWGfpPnruSXBIMurXy2XBSWS2uSJzzHG2foqerARNFyqqrrf72dKXq8UgAj9LmvPY3rWCjNd4u9etBujDLXZZ9PpLIROeqcuEsDaWl8/nqmizkM7yT45sQi4WqTvGQTkKmTlF8NA3ziaVv3H/Dp6kqBZAU+Cg1eLnu7RaGNMggBcIiUzJBEObTMd4QndbOLkwa5HI0yGP8nrAsAhKCucDz63wJkhE9K6M/p2r4hMwCNDutLsjGJbqf5gqrw7MRyTOCoiICEkM0j8EhlvLa7FbvguD9/+h/g6hZJrMolasCfkW6q9+M68aLnVYXt/fLt5u9/cX35I73hfeVt+11vT2v5/3oHXnHXuT97v3h/eX9vfbP+pfrX69vzV1v31pgPvdqY739L4qSnVk=</latexit>
26. There are lots of opportunities in studying this data.
26
1. Higher-order data is pervasive!
We have ways to represent data, and higher-order link prediction is a
general framework for comparing comparing models and methods.
2. There is rich static and temporal structure in the datasets we collected.
3. Do you have a better model, algorithm, or data representation?
Great! Show us by out-performing these baselines.
Data. bit.ly/sc-holp-data
Code. bit.ly/sc-holp-code
27. Simplicial closure and higher-order link prediction.
27
Paper. arXiv 1802.06916
Data. bit.ly/sc-holp-data
Code. bit.ly/sc-holp-code
Slides. bit.ly/arb-SIAMAN18
Austin R. Benson
http://cs.cornell.edu/~arb
@austinbenson
arb@cs.cornell.edu
THANKS!
i
j k
28. Most open triangles do not come from asynchronous
temporal behavior.
28
i
j k
In 61.1% to 97.4% of open triangles,
all three pairs of edges have an
overlapping period of activity.
⟶ there is an overlapping period of
activity between all 3 edges
(Helly’s theorem).
# overlaps
Dataset # open triangles 0 1 2 3
coauth-DBLP 1,295,214 0.012 0.143 0.123 0.722
coauth-MAG-history 96,420 0.002 0.055 0.059 0.884
coauth-MAG-geology 2,494,960 0.010 0.128 0.109 0.753
tags-stack-overflow 300,646,440 0.002 0.067 0.071 0.860
tags-math-sx 2,666,353 0.001 0.040 0.049 0.910
tags-ask-ubuntu 3,288,058 0.002 0.088 0.085 0.825
threads-stack-overflow 99,027,304 0.001 0.034 0.037 0.929
threads-math-sx 11,294,665 0.001 0.038 0.039 0.922
threads-ask-ubuntu 136,374 0.000 0.020 0.023 0.957
NDC-substances 1,136,357 0.020 0.196 0.151 0.633
NDC-classes 9,064 0.022 0.191 0.136 0.652
DAWN 5,682,552 0.027 0.216 0.155 0.602
congress-committees 190,054 0.001 0.046 0.058 0.895
congress-bills 44,857,465 0.003 0.063 0.113 0.821
email-Enron 3,317 0.008 0.130 0.151 0.711
email-Eu 234,600 0.010 0.131 0.132 0.727
contact-high-school 31,850 0.000 0.015 0.019 0.966
contact-primary-school 98,621 0.000 0.012 0.014 0.974
music-rap-genius 70,057 0.028 0.221 0.141 0.611
29. Simplicial closure probability on 4 nodes has similar
behavior to those with 3 nodes,just“up one dimension”.
LA/OPT Seminar 29
Take first 80% of the data (in time),record the configuration of every 4 nodes,
and compute the fraction that simplicially close in the final 20% of the data.
Increased edge density
increases closure probability.
Increased simplicial tie strength
increases closure probability.
Tension between
simplicial density and
simplicial tie strength.