F. Petroni, L. Del Corro and R. Gemulla:
"CORE: Context-Aware Open Relation Extraction with Factorization Machines."
In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
Abstract: "We propose CORE, a novel matrix factorization model that leverages contextual information for open relation extraction. Our model is based on factorization machines and integrates facts from various sources, such as knowledge bases or open information extractors, as well as the context in which these facts have been ob- served. We argue that integrating contextual information—such as metadata about extraction sources, lexical context, or type information—significantly improves prediction performance. Open information extractors, for example, may produce extractions that are unspecific or ambiguous when taken out of context. Our experimental study on a large real-world dataset indicates that CORE has significantly better prediction performance than state-of- the-art approaches when contextual information is available."
This document discusses Bayesian model choice and alternatives. It covers several key topics:
1. The Bayesian framework for inference which conditions on observations using Bayes' theorem and the posterior distribution. This provides a coherent way to incorporate new information.
2. Choosing between models or testing models using Bayesian methods. The posterior distribution is compared for different models given the data.
3. Some criticisms of Bayesian inference including that noninformative priors are difficult to define and may not truly represent no information. Improper priors are also controversial but can be justified in some cases.
4. Alternatives to traditional Bayesian hypothesis testing are discussed to address criticisms of the Bayesian approach to model selection and hypothesis testing.
1) The document proposes GASGD, a distributed asynchronous stochastic gradient descent algorithm for matrix completion via graph partitioning.
2) GASGD aims to efficiently exploit computer cluster architectures by mitigating load imbalance, minimizing communication, and tuning synchronization frequency among computing nodes.
3) The key contributions of GASGD are a graph partitioning-based input slicing solution for load balancing, reducing shared data usage by leveraging characteristics of bipartite power-law datasets, and introducing a synchronization parameter to trade off communication cost and convergence rate.
F. Petroni and L. Querzoni:
"HSIENA: a hybrid publish/subscribe system."
In: Proceedings of the 1st International Workshop on Dependable and Secure Computing for Large-scale Complex Critical Infrastructures (DESEC-LCCI), 2012.
Abstract: "The SIENA publish/subscribe system represents a prototypical design for a distributed event notification service implementing the content-based publish/subscribe communication paradigm. A clear shortcoming of SIENA is represented by its static configuration that must be managed and updated by human administrators every time one of its internal processes (brokers) needs to be added or repaired
(e.g. due to a crash failure). This problem limits the applicability of SIENA in large complex critical infrastructures where self-adaptation and -configuration are crucial requirements. In this paper we propose HSIENA, a hybrid architecture that complements SIENA by adding the ability to self-reconfigure after broker additions and removals. The architecture has a novel design that mixes the classic SIENA’s distributed architecture with a highly available cloud-based storage service."
F. Petroni, L. Querzoni, K. Daudjee, S. Kamali and G. Iacoboni:
"HDRF: Stream-Based Partitioning for Power-Law Graphs."
In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM), 2015.
Abstract: "Balanced graph partitioning is a fundamental problem that is receiving growing attention with the emergence of distributed graph-computing (DGC) frameworks. In these frameworks, the partitioning strategy plays an important role since it drives the communication cost and the workload balance among computing nodes, thereby affecting system performance.
However, existing solutions only partially exploit a key characteristic of natural graphs commonly found in the
real-world: their highly skewed power-law degree distributions.
In this paper, we propose High-Degree (are) Replicated First (HDRF), a novel streaming vertex-cut graph partitioning algorithm that effectively exploits skewed degree distributions by explicitly taking into account vertex degree in the placement decision. We analytically and experimentally evaluate HDRF on both synthetic and real-world graphs and show that it outperforms all existing algorithms in partitioning quality."
F. Petroni, L. Querzoni, R. Beraldi, M. Paolucci:
"LCBM: Statistics-Based Parallel Collaborative Filtering."
In: Proceedings of the 17th International Conference on Business Information Systems (BIS), 2014.
Abstract: "In the last ten years, recommendation systems evolved from novelties to powerful business tools, deeply changing the internet industry. Collaborative Filtering (CF) represents today’s a widely adopted strategy to build recommendation engines. The most advanced CF techniques (i.e. those based on matrix factorization) provide high quality results, but may incur prohibitive computational costs when applied to very large data sets. In this paper we present Linear Classifier of Beta distributions Means (LCBM), a novel collaborative filtering algorithm for binary ratings that is (i) inherently parallelizable and (ii) provides results whose quality is on-par with state-of-the-art solutions (iii) at a fraction of the computational cost."
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
PhD Thesis
F. Petroni:
"Mining at scale with latent factor models for matrix completion."
Sapienza University of Rome, 2016.
Abstract: "Predicting which relationships are likely to occur between real-world objects is a key task for several applications. For instance, recommender systems aim at predicting the existence of unknown relationships between users and items, and exploit this information to provide personalized suggestions for items to be of use to a specific user. Matrix completion techniques aim at solving this task, identifying and leveraging the latent factors that triggered the the creation of known relationships to infer missing ones.
This problem, however, is made challenging by the size of today’s datasets. One way to handle such large-scale data, in a reasonable amount of time, is to distribute the matrix completion procedure over a cluster of commodity machines. However, current approaches lack of efficiency and scalability, since, for instance, they do not minimize the communication or ensure a balance workload in the cluster.
A further aspect of matrix completion techniques we investigate is how to improve their prediction performance. This can be done, for instance, considering the context in which relationships have been captured. However, incorporating generic contextual information within a matrix completion algorithm is a challenging task.
In the first part of this thesis, we study distributed matrix completion solutions, and address the above issues by examining input slicing techniques based on graph partitioning algorithms. In the second part of the thesis, we focus on context-aware matrix completion techniques, providing solutions that can work both (i) when the revealed entries in the matrix have multiple values and (ii) all the same value."
This document introduces Factorization Machines, a general model that can mimic many successful factorization models. Factorization Machines allow feature vectors to be easily input and enjoy benefits of factorizing interactions between variables. The model has properties like expressiveness, multi-linearity, and scalable complexity. It relates to models like matrix factorization, tensor factorization, SVD++, and nearest neighbor models. Experiments show Factorization Machines outperform other models on rating prediction, context-aware recommendation, and tag recommendation tasks.
This document discusses Bayesian model choice and alternatives. It covers several key topics:
1. The Bayesian framework for inference which conditions on observations using Bayes' theorem and the posterior distribution. This provides a coherent way to incorporate new information.
2. Choosing between models or testing models using Bayesian methods. The posterior distribution is compared for different models given the data.
3. Some criticisms of Bayesian inference including that noninformative priors are difficult to define and may not truly represent no information. Improper priors are also controversial but can be justified in some cases.
4. Alternatives to traditional Bayesian hypothesis testing are discussed to address criticisms of the Bayesian approach to model selection and hypothesis testing.
1) The document proposes GASGD, a distributed asynchronous stochastic gradient descent algorithm for matrix completion via graph partitioning.
2) GASGD aims to efficiently exploit computer cluster architectures by mitigating load imbalance, minimizing communication, and tuning synchronization frequency among computing nodes.
3) The key contributions of GASGD are a graph partitioning-based input slicing solution for load balancing, reducing shared data usage by leveraging characteristics of bipartite power-law datasets, and introducing a synchronization parameter to trade off communication cost and convergence rate.
F. Petroni and L. Querzoni:
"HSIENA: a hybrid publish/subscribe system."
In: Proceedings of the 1st International Workshop on Dependable and Secure Computing for Large-scale Complex Critical Infrastructures (DESEC-LCCI), 2012.
Abstract: "The SIENA publish/subscribe system represents a prototypical design for a distributed event notification service implementing the content-based publish/subscribe communication paradigm. A clear shortcoming of SIENA is represented by its static configuration that must be managed and updated by human administrators every time one of its internal processes (brokers) needs to be added or repaired
(e.g. due to a crash failure). This problem limits the applicability of SIENA in large complex critical infrastructures where self-adaptation and -configuration are crucial requirements. In this paper we propose HSIENA, a hybrid architecture that complements SIENA by adding the ability to self-reconfigure after broker additions and removals. The architecture has a novel design that mixes the classic SIENA’s distributed architecture with a highly available cloud-based storage service."
F. Petroni, L. Querzoni, K. Daudjee, S. Kamali and G. Iacoboni:
"HDRF: Stream-Based Partitioning for Power-Law Graphs."
In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM), 2015.
Abstract: "Balanced graph partitioning is a fundamental problem that is receiving growing attention with the emergence of distributed graph-computing (DGC) frameworks. In these frameworks, the partitioning strategy plays an important role since it drives the communication cost and the workload balance among computing nodes, thereby affecting system performance.
However, existing solutions only partially exploit a key characteristic of natural graphs commonly found in the
real-world: their highly skewed power-law degree distributions.
In this paper, we propose High-Degree (are) Replicated First (HDRF), a novel streaming vertex-cut graph partitioning algorithm that effectively exploits skewed degree distributions by explicitly taking into account vertex degree in the placement decision. We analytically and experimentally evaluate HDRF on both synthetic and real-world graphs and show that it outperforms all existing algorithms in partitioning quality."
F. Petroni, L. Querzoni, R. Beraldi, M. Paolucci:
"LCBM: Statistics-Based Parallel Collaborative Filtering."
In: Proceedings of the 17th International Conference on Business Information Systems (BIS), 2014.
Abstract: "In the last ten years, recommendation systems evolved from novelties to powerful business tools, deeply changing the internet industry. Collaborative Filtering (CF) represents today’s a widely adopted strategy to build recommendation engines. The most advanced CF techniques (i.e. those based on matrix factorization) provide high quality results, but may incur prohibitive computational costs when applied to very large data sets. In this paper we present Linear Classifier of Beta distributions Means (LCBM), a novel collaborative filtering algorithm for binary ratings that is (i) inherently parallelizable and (ii) provides results whose quality is on-par with state-of-the-art solutions (iii) at a fraction of the computational cost."
Mining at scale with latent factor models for matrix completionFabio Petroni, PhD
PhD Thesis
F. Petroni:
"Mining at scale with latent factor models for matrix completion."
Sapienza University of Rome, 2016.
Abstract: "Predicting which relationships are likely to occur between real-world objects is a key task for several applications. For instance, recommender systems aim at predicting the existence of unknown relationships between users and items, and exploit this information to provide personalized suggestions for items to be of use to a specific user. Matrix completion techniques aim at solving this task, identifying and leveraging the latent factors that triggered the the creation of known relationships to infer missing ones.
This problem, however, is made challenging by the size of today’s datasets. One way to handle such large-scale data, in a reasonable amount of time, is to distribute the matrix completion procedure over a cluster of commodity machines. However, current approaches lack of efficiency and scalability, since, for instance, they do not minimize the communication or ensure a balance workload in the cluster.
A further aspect of matrix completion techniques we investigate is how to improve their prediction performance. This can be done, for instance, considering the context in which relationships have been captured. However, incorporating generic contextual information within a matrix completion algorithm is a challenging task.
In the first part of this thesis, we study distributed matrix completion solutions, and address the above issues by examining input slicing techniques based on graph partitioning algorithms. In the second part of the thesis, we focus on context-aware matrix completion techniques, providing solutions that can work both (i) when the revealed entries in the matrix have multiple values and (ii) all the same value."
This document introduces Factorization Machines, a general model that can mimic many successful factorization models. Factorization Machines allow feature vectors to be easily input and enjoy benefits of factorizing interactions between variables. The model has properties like expressiveness, multi-linearity, and scalable complexity. It relates to models like matrix factorization, tensor factorization, SVD++, and nearest neighbor models. Experiments show Factorization Machines outperform other models on rating prediction, context-aware recommendation, and tag recommendation tasks.
This document provides an overview of maximum likelihood estimation (MLE). It discusses why MLE is needed for nonlinear models, the general steps for obtaining MLEs, and some key properties. The document also includes an example of calculating the MLE for a Poisson distribution in R. Key points covered include deriving the likelihood function, taking derivatives to find the MLE, and measuring uncertainty around the MLE estimate.
The document discusses recommender systems and summarizes:
1) It introduces recommender systems and the different types including knowledge-based, collaborative filtering, and content-based recommendations.
2) It outlines some of the key resources for recommender systems including datasets, conferences, and articles.
3) It provides a high-level overview of common recommender system approaches like collaborative filtering, content-based analysis, and knowledge-based recommendations.
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...RuleML
The document discusses the semantics of Notation3 logic, which is a rule logic for the Semantic Web invented by Tim Berners-Lee and Dan Connolly. Notation3 logic allows for implicit quantification using blank nodes and variables starting with question marks. The semantics of Notation3 logic is not fully defined, though implementations provide some guidance. The paper proposes handling implicit quantification by defining the scope of existential variables as the formula they occur in, while universal variables have scope that includes descendant formulas. A formal semantics is introduced using substitutions and their component-wise and total applications to formulas.
slides of ABC talk at i-like workshop, Warwick, May 16Christian Robert
Approximate Bayesian computation (ABC) is a new empirical Bayes method for performing Bayesian inference when the likelihood function is intractable or unavailable in closed form. ABC replaces the likelihood with a non-parametric approximation based on simulating data under different parameter values and comparing simulated and observed data using summary statistics. This allows Bayesian inference to be performed even when direct calculation of the likelihood is not possible. However, ABC introduces an approximation error that is unknown without extensive simulation. Some view ABC as a true Bayesian approach for an estimated or noisy likelihood, while others see it as more of a computational technique that is only approximately Bayesian.
PCA is a technique to reduce the dimensionality of multivariate data while retaining essential information. It works by transforming the data to a new coordinate system such that the greatest variance by any projection of the data lies on the first coordinate, called the first principal component. Subsequent components account for remaining variance while being orthogonal to previous components. PCA is performed by computing the eigenvalues and eigenvectors of the covariance matrix of the data, with the principal components being the eigenvectors. This allows visualization and interpretation of high-dimensional data in lower dimensions.
Machine learning is a branch of artificial intelligence concerned with building systems that can learn from data. The document discusses various machine learning concepts including what machine learning is, related fields, the machine learning workflow, challenges, different types of machine learning algorithms like supervised learning, unsupervised learning and reinforcement learning, and popular Python libraries used for machine learning like Scikit-learn, Pandas and Matplotlib. It also provides examples of commonly used machine learning algorithms and datasets.
This document discusses machine learning techniques for modeling document collections. It introduces topic models, which represent documents as mixtures of topics and topics as mixtures of words. Topic models provide dimensionality reduction and allow semantic-based browsing of document collections. Variational inference methods are described for approximating the posterior distribution in topic models like LDA and correlated topic models.
Building WordSpaces via Random Indexing from simple to complex spacesPierpaolo Basile
This presentation describes two approaches to compositional semantics in distributional semantic spaces.
Both approaches conceive the semantics of complex structures, such as phrases or sentences, as being other than
the sum of its terms.
Syntax is the plus used as a glue to compose words.
The former kind of approach encodes information about syntactic dependencies directly into distributional spaces, the latter exploits compositional operators reflecting the syntactic role of words.
Volesti is an open source C++ library with an R package for high-dimensional sampling and volume computation. It provides three volume estimation algorithms and four sampling algorithms that use geometric random walks. These algorithms allow for volume computation and uniform sampling in thousands of dimensions where other approaches fail due to the curse of dimensionality. Volesti has applications in multivariate integration, crisis detection in financial markets, and was involved in Google Summer of Code 2020 for several related projects.
This document summarizes research into how dimensionality affects variable interaction in large-scale optimization problems (LSOPs). The researchers estimated correlation between variable pairs in 86 test problems by adapting a covariance matrix. They found that correlation decays with increasing dimensions, such that problems with 100+ dimensions show weak correlation and 500+ dimensions show nearly null correlation. This suggests LSOPs can be efficiently optimized by exploiting directions along axes rather than exploratory diagonal moves, due to their weak variable correlation in practice. Further studies will propose lighter covariance adaptation and algorithms exploiting the weak correlation between variable pairs.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document introduces a four-valued logic proposed by the author. It begins by discussing other multi-valued logics proposed in the early 20th century. It then describes the author's four-valued logic, including its truth tables. It also compares it to other logics. The document outlines a C program the author wrote to discover possible logics based on input constraints. It provides an example of the input constraints used and shows the C source code. The document discusses properties checked by the program and outputs solutions if they satisfy the properties.
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Statistical software for Sampling from Finite Populations: an analysis using ...michele de meo
The thesis analyzes statistical software for probability proportional to size (PPS) sampling without replacement using Monte Carlo simulations. It tests if software like SAS, SPSS, and R correctly implement the Hanurav-Vijayan and Rao-Sampford algorithms by comparing estimated joint selection probabilities from simulations to the true probabilities. The results show R estimates match the true probabilities, but SAS and SPSS estimates differ significantly, suggesting their implementations are incorrect or use a suboptimal pseudorandom number generator.
Statistics 101 for System AdministratorsRoberto Polli
Learn and use elements of statistics (distributions, standard deviation, linear correlation) in python is very simple.
The slides shows an example of managing some dataseries for network troubleshooting.
The document discusses syntax, semantics, and intended meaning in symbol systems. It defines:
- Syntax as symbols and rules for composition into structures
- Semantics as the relationship between syntax and intended meaning
- Intended meaning as truths about the world that the symbol system represents
It then summarizes key concepts in logic programming including terms, substitutions, instances, queries, rules, and the declarative and operational semantics of logic programs.
A Systematic Approach To Probabilistic Pointer AnalysisMonica Franklin
This document presents a formal framework for probabilistic pointer analysis of probabilistic programs. It describes constructing a discrete-time Markov chain representing the concrete semantics of a probabilistic While program with static pointers by composing the probabilistic control flow and data updates using a tensor product. It then applies probabilistic abstract interpretation to obtain an abstract semantics with drastically reduced size. The analysis systematically derives probabilistic pointer information like points-to matrices and tensors, providing an alternative to experimental profiling approaches.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
This document provides an overview of maximum likelihood estimation (MLE). It discusses why MLE is needed for nonlinear models, the general steps for obtaining MLEs, and some key properties. The document also includes an example of calculating the MLE for a Poisson distribution in R. Key points covered include deriving the likelihood function, taking derivatives to find the MLE, and measuring uncertainty around the MLE estimate.
The document discusses recommender systems and summarizes:
1) It introduces recommender systems and the different types including knowledge-based, collaborative filtering, and content-based recommendations.
2) It outlines some of the key resources for recommender systems including datasets, conferences, and articles.
3) It provides a high-level overview of common recommender system approaches like collaborative filtering, content-based analysis, and knowledge-based recommendations.
RuleML 2015: Semantics of Notation3 Logic: A Solution for Implicit Quantifica...RuleML
The document discusses the semantics of Notation3 logic, which is a rule logic for the Semantic Web invented by Tim Berners-Lee and Dan Connolly. Notation3 logic allows for implicit quantification using blank nodes and variables starting with question marks. The semantics of Notation3 logic is not fully defined, though implementations provide some guidance. The paper proposes handling implicit quantification by defining the scope of existential variables as the formula they occur in, while universal variables have scope that includes descendant formulas. A formal semantics is introduced using substitutions and their component-wise and total applications to formulas.
slides of ABC talk at i-like workshop, Warwick, May 16Christian Robert
Approximate Bayesian computation (ABC) is a new empirical Bayes method for performing Bayesian inference when the likelihood function is intractable or unavailable in closed form. ABC replaces the likelihood with a non-parametric approximation based on simulating data under different parameter values and comparing simulated and observed data using summary statistics. This allows Bayesian inference to be performed even when direct calculation of the likelihood is not possible. However, ABC introduces an approximation error that is unknown without extensive simulation. Some view ABC as a true Bayesian approach for an estimated or noisy likelihood, while others see it as more of a computational technique that is only approximately Bayesian.
PCA is a technique to reduce the dimensionality of multivariate data while retaining essential information. It works by transforming the data to a new coordinate system such that the greatest variance by any projection of the data lies on the first coordinate, called the first principal component. Subsequent components account for remaining variance while being orthogonal to previous components. PCA is performed by computing the eigenvalues and eigenvectors of the covariance matrix of the data, with the principal components being the eigenvectors. This allows visualization and interpretation of high-dimensional data in lower dimensions.
Machine learning is a branch of artificial intelligence concerned with building systems that can learn from data. The document discusses various machine learning concepts including what machine learning is, related fields, the machine learning workflow, challenges, different types of machine learning algorithms like supervised learning, unsupervised learning and reinforcement learning, and popular Python libraries used for machine learning like Scikit-learn, Pandas and Matplotlib. It also provides examples of commonly used machine learning algorithms and datasets.
This document discusses machine learning techniques for modeling document collections. It introduces topic models, which represent documents as mixtures of topics and topics as mixtures of words. Topic models provide dimensionality reduction and allow semantic-based browsing of document collections. Variational inference methods are described for approximating the posterior distribution in topic models like LDA and correlated topic models.
Building WordSpaces via Random Indexing from simple to complex spacesPierpaolo Basile
This presentation describes two approaches to compositional semantics in distributional semantic spaces.
Both approaches conceive the semantics of complex structures, such as phrases or sentences, as being other than
the sum of its terms.
Syntax is the plus used as a glue to compose words.
The former kind of approach encodes information about syntactic dependencies directly into distributional spaces, the latter exploits compositional operators reflecting the syntactic role of words.
Volesti is an open source C++ library with an R package for high-dimensional sampling and volume computation. It provides three volume estimation algorithms and four sampling algorithms that use geometric random walks. These algorithms allow for volume computation and uniform sampling in thousands of dimensions where other approaches fail due to the curse of dimensionality. Volesti has applications in multivariate integration, crisis detection in financial markets, and was involved in Google Summer of Code 2020 for several related projects.
This document summarizes research into how dimensionality affects variable interaction in large-scale optimization problems (LSOPs). The researchers estimated correlation between variable pairs in 86 test problems by adapting a covariance matrix. They found that correlation decays with increasing dimensions, such that problems with 100+ dimensions show weak correlation and 500+ dimensions show nearly null correlation. This suggests LSOPs can be efficiently optimized by exploiting directions along axes rather than exploratory diagonal moves, due to their weak variable correlation in practice. Further studies will propose lighter covariance adaptation and algorithms exploiting the weak correlation between variable pairs.
An overview of existing solutions for link discovery and looked into some of the state-of-art algorithms for the rapid execution of link discovery tasks focusing on algorithms which guarantee result completeness.
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
This document introduces a four-valued logic proposed by the author. It begins by discussing other multi-valued logics proposed in the early 20th century. It then describes the author's four-valued logic, including its truth tables. It also compares it to other logics. The document outlines a C program the author wrote to discover possible logics based on input constraints. It provides an example of the input constraints used and shows the C source code. The document discusses properties checked by the program and outputs solutions if they satisfy the properties.
Kernel methods and variable selection for exploratory analysis and multi-omic...tuxette
Nathalie Vialaneix
4th course on Computational Systems Biology of Cancer: Multi-omics and Machine Learning Approaches
International course, Curie training
https://training.institut-curie.org/courses/sysbiocancer2021
(remote)
September 29th, 2021
Statistical software for Sampling from Finite Populations: an analysis using ...michele de meo
The thesis analyzes statistical software for probability proportional to size (PPS) sampling without replacement using Monte Carlo simulations. It tests if software like SAS, SPSS, and R correctly implement the Hanurav-Vijayan and Rao-Sampford algorithms by comparing estimated joint selection probabilities from simulations to the true probabilities. The results show R estimates match the true probabilities, but SAS and SPSS estimates differ significantly, suggesting their implementations are incorrect or use a suboptimal pseudorandom number generator.
Statistics 101 for System AdministratorsRoberto Polli
Learn and use elements of statistics (distributions, standard deviation, linear correlation) in python is very simple.
The slides shows an example of managing some dataseries for network troubleshooting.
The document discusses syntax, semantics, and intended meaning in symbol systems. It defines:
- Syntax as symbols and rules for composition into structures
- Semantics as the relationship between syntax and intended meaning
- Intended meaning as truths about the world that the symbol system represents
It then summarizes key concepts in logic programming including terms, substitutions, instances, queries, rules, and the declarative and operational semantics of logic programs.
A Systematic Approach To Probabilistic Pointer AnalysisMonica Franklin
This document presents a formal framework for probabilistic pointer analysis of probabilistic programs. It describes constructing a discrete-time Markov chain representing the concrete semantics of a probabilistic While program with static pointers by composing the probabilistic control flow and data updates using a tensor product. It then applies probabilistic abstract interpretation to obtain an abstract semantics with drastically reduced size. The analysis systematically derives probabilistic pointer information like points-to matrices and tensors, providing an alternative to experimental profiling approaches.
Similar to CORE: Context-Aware Open Relation Extraction with Factorization Machines (20)
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
CORE: Context-Aware Open Relation Extraction with Factorization Machines
1. CORE: Context-Aware Open Relation Extraction
with Factorization Machines
Fabio Petroni
Luciano Del Corro Rainer Gemulla
2. Open relation extraction
I Open relation extraction is the task of extracting new facts for
a potentially unbounded set of relations from various sources
natural
language
text
knowledge
bases
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 2 of 21
3. Input data: facts from natural language text
Enrico Fermi was a
professor in theoretical
physics at Sapienza
University of Rome.
"professor at"(Fermi,Sapienza)
tuple
sub objrel
surface fact
open information
extractor
extract all
facts in textsurface relation
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 3 of 21
4. Input data: facts from knowledge bases
natural
language
text
Fermi
Sapienza
employee(Fermi,Sapienza)
KB fact
employee
KB relation
"professor at"(Fermi,Sapienza)
surface fact
entity link
e.g., string match
heuristic
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 4 of 21
5. Relation extraction techniques taxonomy
distant
supervision
relation
extraction
tensor
completion
matrix
completion
RESCAL PITF NFE CORE
relation
clustering
set of predefined
relations
"black and white"
approach
limited scalability with the number
of relations; large prediction space
latent factors
models
openclose
in-KB in-KB
out of-KB
restricted prediction space
(Nickel et al., 2011) (Drumond et al., 2012) (Petroni et al., 2015)(Riedel et al., 2013)
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 5 of 21
6. Matrix completion for open relation extraction
(Caesar,Rome)
(Fermi,Rome)
(Fermi,Sapienza)
(de Blasio,NY)
1
1
1
1
1
KB relationsurface relation
employeeborn in professor at mayor of
tuples x relations
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 6 of 21
7. Matrix completion for open relation extraction
(Caesar,Rome)
(Fermi,Rome)
(Fermi,Sapienza)
(de Blasio,NY)
1
1
1
1
1
KB relationsurface relation
employeeborn in professor at mayor of
tuples x relations
?
? ??
? ??
?
? ? ?
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 6 of 21
8. Matrix factorization
I learn latent semantic representations of tuples and relations
tuple latent
factor vector
relation latent
factor vector
dot product
I leverage latent representations to predict new facts
professor at
related with science0.8
(Fermi,Sapienza)
0.9
-0.5 -0.3 related with sport
I in real applications latent factors are uninterpretable
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 7 of 21
9. Matrix factorization
CORE integrates contextual
information into such
models to improve
prediction performance
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 7 of 21
10. Contextual information
Tom Peloso joined
Modest Mouse to record
their fifth studio album.
person
organization
surface relation
"join"(Peloso,Modest Mouse)
unspecific relation
entity types
article topic
words record album
Contextual information
named entity
recognizer
label entity
with coarse-
grained type
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 8 of 21
11. Contextual information
Tom Peloso joined
Modest Mouse to record
their fifth studio album.
person
organization
surface relation
"join"(Peloso,Modest Mouse)
unspecific relation
entity types
article topic
words record album
Contextual information
How to incorporate contextual
information within the model?
How to incorporate contextual
information within the model?
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 8 of 21
12. CORE - latent representations of variables
I associates latent representations fv with each variable v 2 V
Peloso
(Peloso,Modest Mouse)
Modest Mouse
join
person
organization
record
album
tuple
relation
entities
context
latent
factor
vectors
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 9 of 21
13. CORE - modeling facts
1 0 0 1 0 0 0.5 0.5 0 0 0 1 0 1
0 1 0 0 0 1 0 0 0.5 0.5 1 0 1 0
1 0 0 0 1 0 0 0.5 0.5 0 0 1 0.6 0.4
0 0 1 0 0 1 0 0 0.5 0.5 1 0 1 0
“born in”(x,y) employee(x,y) Caesar,Rome Fermi,Rome Fermi,Sapienza Caesar Rome Fermi Sapienza person,
organization
person,
location
physics history
relations tuples entities tuple types tuple topics
x1
x2
x3
x4
Surface KB Context
…
“professor at”(x,y)
I models the input data in terms of a matrix in which each row
corresponds to a fact x and each column to a variable v
I groups columns according to the type of the variables
I in each row the values of each column group sum up to unity
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 10 of 21
14. CORE - modeling context
1 0 0 1 0 0 0.5 0.5 0 0 0 1 0 1
0 1 0 0 0 1 0 0 0.5 0.5 1 0 1 0
1 0 0 0 1 0 0 0.5 0.5 0 0 1 0.6 0.4
0 0 1 0 0 1 0 0 0.5 0.5 1 0 1 0
“born in”(x,y) employee(x,y) Caesar,Rome Fermi,Rome Fermi,Sapienza Caesar Rome Fermi Sapienza person,
organization
person,
location
physics history
relations tuples entities tuple types tuple topics
x1
x2
x3
x4
Surface KB Context
…
“professor at”(x,y)
I aggregates and normalizes contextual information by tuple
B a fact can be observed multiple times with di↵erent context
B there is no context for new facts (never observed in input)
I this approach allows us to provide comprehensive contextual
information for both observed and unobserved facts
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 11 of 21
15. CORE - factorization model
1 0 0 1 0 0 0.5 0.5 0 0 0 1 0 1
0 1 0 0 0 1 0 0 0.5 0.5 1 0 1 0
1 0 0 0 1 0 0 0.5 0.5 0 0 1 0.6 0.4
0 0 1 0 0 1 0 0 0.5 0.5 1 0 1 0
“born in”(x,y) employee(x,y) Caesar,Rome Fermi,Rome Fermi,Sapienza Caesar Rome Fermi Sapienza person,
organization
person,
location
physics history
relations tuples entities tuple types tuple topics
x1
x2
x3
x4
Surface KB Context
…
“professor at”(x,y)
I uses factorization machines as underlying framework
I associates a score s(x) with each fact x
s(x) =
X
v12V
X
v22V {v1}
xv1 xv2 f T
v1
fv2
I weighted pairwise interactions of latent factor vectors
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 12 of 21
16. CORE - prediction
produce a ranked list of tuples for each relation
goal
I rank reflects the likelihood that the corresponding fact is true
I to generate this ranked list:
B fix a relation r
B retrieve all tuples t, s.t. the fact r(t) is not observed
B add tuple context
B rank unobserved facts by their scores
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 13 of 21
17. CORE - parameter estimation
I parameters: ⇥ = { bv , fv | v 2 V }
I all our observations are positive, no negative training data
I Bayesian personalized ranking, open-world assumption
professor at
location
tuple entities tuple contextfact
(Fermi,Sapienza)
Fermi
Sapienza organization
physics
person
x
(Caesar,Rome)
professor at Caesar
locationRome
history
tuple entities tuple contextfact
person
x-sampled negative evidenceobserved fact
I pairwise approach, x is more likely to be true than x-
maximize
X
x
f (s(x) s(x-))
I stochastic gradient ascent
⇥ ⇥ + ⌘r⇥ ( )
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 14 of 21
18. Experiments - dataset
440k facts
extracted from
corpus
15k facts
from
entity mentions
linked using
string matching
I Contextual information
article metadata bag-of-word
sentences where
the fact has been
extracted
entity type
person
organization
location
miscellaneous
news desk (e.g., foreign desk)
descriptors (e.g., finances)
online section (e.g., sports)
section (e.g., a, d)
publication year m t w
I letters to indicate contextual information considered
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 15 of 21
19. Experiments - methodology
I we consider (to keep experiments feasible):
10k tuples 10 surface relations19 Freebase relations
I for each relation and method:
B we rank the tuples subsample
B we consider the top-100 predictions and label them manually
I evaluation metrics:
number of true facts MAP (quality of the ranking)
I methods:
B PITF, tensor factorization method
B NFE, matrix completion method (context-agnostic)
B CORE, uses relations, tuples and entities as variables
B CORE+m, +t, +w, +mt, +mtw
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 16 of 21
22. Anecdotal results
author(x,y)
ranked list of tuples
1 (Winston Groom, Forrest Gump)
2 (D. M. Thomas, White Hotel)
3 (Roger Rosenblatt, Life Itself)
4 (Edmund White, Skinned Alive)
5 (Peter Manso, Brando: The Biography)
similar relations
0.98 “reviews x by y”(x,y)
0.97 “book by”(x,y)
0.95 “author of”(x,y)
0.95 ” ‘s novel”(x,y)
0.95 “ ‘s book”(x,y)
similar relations
0.87 “scientist”(x,y)
0.84 “scientist with”(x,y)
0.80 “professor at”(x,y)
0.79 “scientist for”(x,y)
0.78 “neuroscientist at”(x,y)
ranked list of tuples
1 (Riordan Roett, Johns Hopkins University)
2 (Dr. R. M. Roberts, University of Missouri)
3 (Linda Mayes, Yale University)
4 (Daniel T. Jones, Cardiff Business School)
5 (Russell Ross, University of Iowa)
“scientist at”(x,y)
I semantic similarity of relations is one aspect of our model
I similar relations treated di↵erently in di↵erent contexts
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 19 of 21
23. Conclusion
I CORE, a matrix factorization model for open relation
extraction that incorporates contextual information
I based on factorization machines and open-world assumption
I extensible model, additional contextual information can be
integrated when available
I experimental study suggests that exploiting context can
significantly improve prediction performance
I Source code, datasets, and supporting material are available
at https://github.com/fabiopetroni/CORE
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 20 of 21
24. Thank you!
Questions?
Fabio Petroni
Sapienza University of Rome, Italy
Current position:
PhD Student in Engineering in Computer Science
Research Interests:
data mining, machine learning, big data
petroni@dis.uniroma1.it
EMNLP 2015. September 17-21, 2015. Lisbon, Portugal. 21 of 21