Probabilistic Relational Models (PRMs) are directed probabilisticgraphical models representing a factored joint distribution over a set of random variables for relational datasets.
While regular PRMs define probabilistic dependencies between classes’ descriptive attributes, an extension called PRM with Reference Uncertainty (PRM-RU) allows in addition to manage link uncertainty between them, by adding random variables called selectors. In order to avoid variables with large domains, selectors are associated with partition functions, mapping objects to a set of clusters, and selectors’ distributions are defined over the set of clusters.
In PRM-RU, the definition of partition functions constrains us to learn them only from concerned individuals entity attributes and to assign the same cluster to a pair of individuals having the same attributes values. This constraint is actually based on a strong assumption which is not generalizable and can lead to an under usage of relationship data for learning. For these reasons, we relax this constraint in this paper and propose a different partition function learning approach based on relationship
data clustering. We empirically show that this approach provides better results than attribute-based learning in the case where relationship topology is independent from involved entity attributes values, and that it gives close results whenever the attributes assumption is correct.
Comparative analysis of algorithms_MADISayed Rahman
The document summarizes research comparing algorithms and methods for categorizing texts. It discusses modifications made to naive Bayes, SVM, and PrTFIDF algorithms to improve classification accuracy. Preliminary processing steps like morphological analysis and parsing were also explored. Experimental results on two test collections showed the modified Bayesian algorithm achieved the best accuracy at 45.46%, outperforming PrTFIDF and SVM. Further areas of potential improvement are identified.
Performance analysis of machine learning approaches in software complexity pr...Sayed Mohsin Reza
This video contains the presentation at TCCE 2020 by Sayed Mohsin Reza on his paper titled "Performance Analysis of Machine Learning Approaches in Software Complexity Prediction"
Keywords: Software Complexity, Software Quality, Machine Learning, Software Design, Software Reliability, etc
Authors :
1. Sayed Mohsin Reza, Ph.D. Student, University of Texas
2. Mahfujur Rahman, Lecturer, Daffodil International University
3. Hasnat Parvez, Student, Jahangirnagar University
4. Omar Badreddin, Professor, University of Texas
5. Shamim Al Mamun, Professor, Jahangirnagar University
Abstract: Software design is one of the core concepts in software engineering. This covers insights and intuitions of software evolution, reliability, and maintainability. Effective software design facilitates software reliability and better quality management during development which reduces software development cost. Therefore, it is required to detect and maintain these issues earlier. Class complexity is one of the ways of detecting software quality. The objective of this paper is to predict class complexity from source code metrics using Machine Learning (ML) approaches and compare the performance of the approaches. In order to do that, we collect ten popular and quality maintained open source repositories and extract 18 source code metrics that relate to complexity for class-level analysis. First, we apply statistical correlation to find out the source code metrics that impact most on class complexity. Second, we apply five alternative ML techniques to build complexity predictors and compare the performances. The results report that the following source code metrics: Depth Inheritance Tree (DIT), Response For Class (RFC), Weighted Method Count (WMC), Lines of Code (LOC), and Coupling Between Objects (CBO) have the most impact on class complexity. Also, we evaluate the performance of the techniques and results show that Random Forest (RF) significantly improves accuracy without providing additional false negative or false positive that work as false alarms in complexity prediction.
[slide] A Compare-Aggregate Model with Latent Clustering for Answer SelectionSeoul National University
CIKM 2019
In this paper, we propose a novel method for a sentence-level answer-selection task that is one of the fundamental problems in natural language processing. First, we explore the effect of additional information by adopting a pretrained language model to compute the vector representation of the input text and by applying transfer learning from a large-scale corpus. Second, we enhance the compare-aggregate model by proposing a novel latent clustering method to compute additional information within the target corpus and by changing the objective function from listwise to pointwise. To evaluate the performance of the proposed approaches, experiments are performed with the WikiQA and TRECQA datasets. The empirical results demonstrate the superiority of our proposed approach, which achieve state-of-the-art performance on both datasets.
The document discusses the SQL standard and its components. It describes how SQL is used to define schemas, manipulate data, write queries involving single or multiple tables, and perform other operations. Key topics covered include data definition language, data manipulation language, data types, integrity constraints, queries, subqueries, and set operations in SQL. Examples of SQL commands for creating tables, inserting data, and writing various types of queries are also provided.
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
The document presents an approach using transformers to learn hierarchical contexts in multiparty dialogue. It proposes new pre-training tasks to improve token-level and utterance-level embeddings for handling dialogue contexts. A multi-task learning approach is introduced to fine-tune the language model for a Friends question answering (FriendsQA) task using dialogue evidence, outperforming BERT and RoBERTa. However, the approach shows no improvement on other character mining tasks from Friends. Future work is needed to better represent speakers and inferences in dialogue.
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
In the task of summarization of a scientific paper, a lot of information stands to be gained about a reference paper, from the papers that cite it. Automatically generating the reference scope (the span of cited text) in a reference paper, corresponding to citances (sentences in the citing papers that cite it) has great significance in preparing a structured summary of the reference paper. We treat this task as a binary classification problem, by extracting feature vectors from pairs of citances and reference sentences. These features are lexical, corpus-based, surface and knowledge-based. We extend the current feature set employed for reference-citance pair identification in the current state-of-the-art system. Using these features, we present a novel classification approach for this task, that employs a deep Convolutional Neural Network along with two boosting ensemble algorithms. We outperform the existing state-of-the- art for distinguishing between cited spans and non-cited spans of text in the reference paper.
The document proposes a novel ranking method called Fidelity Rank (FRank) that combines the probabilistic ranking framework with the generalized additive model. It introduces a new fidelity loss function to address problems with existing loss functions like cross entropy. FRank was tested on TREC and web search datasets and significantly outperformed other learning to rank algorithms like RankBoost, RankNet and RankSVM in terms of metrics like MAP and NDCG. Future work could involve theoretical analysis of FRank's generalization bounds and combining it with other machine learning techniques.
Comparative analysis of algorithms_MADISayed Rahman
The document summarizes research comparing algorithms and methods for categorizing texts. It discusses modifications made to naive Bayes, SVM, and PrTFIDF algorithms to improve classification accuracy. Preliminary processing steps like morphological analysis and parsing were also explored. Experimental results on two test collections showed the modified Bayesian algorithm achieved the best accuracy at 45.46%, outperforming PrTFIDF and SVM. Further areas of potential improvement are identified.
Performance analysis of machine learning approaches in software complexity pr...Sayed Mohsin Reza
This video contains the presentation at TCCE 2020 by Sayed Mohsin Reza on his paper titled "Performance Analysis of Machine Learning Approaches in Software Complexity Prediction"
Keywords: Software Complexity, Software Quality, Machine Learning, Software Design, Software Reliability, etc
Authors :
1. Sayed Mohsin Reza, Ph.D. Student, University of Texas
2. Mahfujur Rahman, Lecturer, Daffodil International University
3. Hasnat Parvez, Student, Jahangirnagar University
4. Omar Badreddin, Professor, University of Texas
5. Shamim Al Mamun, Professor, Jahangirnagar University
Abstract: Software design is one of the core concepts in software engineering. This covers insights and intuitions of software evolution, reliability, and maintainability. Effective software design facilitates software reliability and better quality management during development which reduces software development cost. Therefore, it is required to detect and maintain these issues earlier. Class complexity is one of the ways of detecting software quality. The objective of this paper is to predict class complexity from source code metrics using Machine Learning (ML) approaches and compare the performance of the approaches. In order to do that, we collect ten popular and quality maintained open source repositories and extract 18 source code metrics that relate to complexity for class-level analysis. First, we apply statistical correlation to find out the source code metrics that impact most on class complexity. Second, we apply five alternative ML techniques to build complexity predictors and compare the performances. The results report that the following source code metrics: Depth Inheritance Tree (DIT), Response For Class (RFC), Weighted Method Count (WMC), Lines of Code (LOC), and Coupling Between Objects (CBO) have the most impact on class complexity. Also, we evaluate the performance of the techniques and results show that Random Forest (RF) significantly improves accuracy without providing additional false negative or false positive that work as false alarms in complexity prediction.
[slide] A Compare-Aggregate Model with Latent Clustering for Answer SelectionSeoul National University
CIKM 2019
In this paper, we propose a novel method for a sentence-level answer-selection task that is one of the fundamental problems in natural language processing. First, we explore the effect of additional information by adopting a pretrained language model to compute the vector representation of the input text and by applying transfer learning from a large-scale corpus. Second, we enhance the compare-aggregate model by proposing a novel latent clustering method to compute additional information within the target corpus and by changing the objective function from listwise to pointwise. To evaluate the performance of the proposed approaches, experiments are performed with the WikiQA and TRECQA datasets. The empirical results demonstrate the superiority of our proposed approach, which achieve state-of-the-art performance on both datasets.
The document discusses the SQL standard and its components. It describes how SQL is used to define schemas, manipulate data, write queries involving single or multiple tables, and perform other operations. Key topics covered include data definition language, data manipulation language, data types, integrity constraints, queries, subqueries, and set operations in SQL. Examples of SQL commands for creating tables, inserting data, and writing various types of queries are also provided.
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
The document presents an approach using transformers to learn hierarchical contexts in multiparty dialogue. It proposes new pre-training tasks to improve token-level and utterance-level embeddings for handling dialogue contexts. A multi-task learning approach is introduced to fine-tune the language model for a Friends question answering (FriendsQA) task using dialogue evidence, outperforming BERT and RoBERTa. However, the approach shows no improvement on other character mining tasks from Friends. Future work is needed to better represent speakers and inferences in dialogue.
Reference Scope Identification of Citances Using Convolutional Neural NetworkSaurav Jha
In the task of summarization of a scientific paper, a lot of information stands to be gained about a reference paper, from the papers that cite it. Automatically generating the reference scope (the span of cited text) in a reference paper, corresponding to citances (sentences in the citing papers that cite it) has great significance in preparing a structured summary of the reference paper. We treat this task as a binary classification problem, by extracting feature vectors from pairs of citances and reference sentences. These features are lexical, corpus-based, surface and knowledge-based. We extend the current feature set employed for reference-citance pair identification in the current state-of-the-art system. Using these features, we present a novel classification approach for this task, that employs a deep Convolutional Neural Network along with two boosting ensemble algorithms. We outperform the existing state-of-the- art for distinguishing between cited spans and non-cited spans of text in the reference paper.
The document proposes a novel ranking method called Fidelity Rank (FRank) that combines the probabilistic ranking framework with the generalized additive model. It introduces a new fidelity loss function to address problems with existing loss functions like cross entropy. FRank was tested on TREC and web search datasets and significantly outperformed other learning to rank algorithms like RankBoost, RankNet and RankSVM in terms of metrics like MAP and NDCG. Future work could involve theoretical analysis of FRank's generalization bounds and combining it with other machine learning techniques.
neural based_context_representation_learning_for_dialog_act_classificationJEE HYUN PARK
The document presents a neural network model for dialog act classification that incorporates context representations. It uses a CNN to represent each utterance, applies an internal attention mechanism, and models context with RNNs. As baselines, it uses a single-utterance CNN and concatenation of utterances. Results show RNNs better learn context representations and attention mechanisms improve performance, though the optimal attention placement depends on the dataset. The best-performing models outperform the previous state-of-the-art on benchmark datasets.
Towards advanced data retrieval from learning objects repositoriesValentina Paunovic
This document proposes an advanced search system for learning object repositories. It uses a Steiner trees approach to retrieve groups of related learning objects that satisfy a query, even if no single object matches all query terms. It represents the repository as a sparse weighted graph based on learning object similarities. It also extends the query language with AND and OR operators and presents an algorithm for parsing complex queries. The system aims to enable effective search and reuse of learning materials.
This document summarizes an experiment on measuring relatedness between documents in comparable corpora using distributional similarity measures (DSMs). It finds that DSMs like common tokens (NCT) and Chi-square performed well in filtering out unrelated documents from specialized corpora in English and Italian, but not for the Spanish corpus which appeared to contain less related documents to begin with. The study aims to help automatically describe and evaluate comparable corpora quality by ranking documents based on their relatedness.
This document discusses efficient learning of deterministic finite automata (DFA) from examples. It begins with an introduction to the challenges of learning DFA and different models that have been proposed. The main contributions are: 1) The class of "simple DFA" (those with logarithmic Kolmogorov complexity) is efficiently PAC learnable if examples are drawn from a universal distribution. 2) The entire class of DFA is efficiently PAC learnable under the PACS model where a knowledgeable teacher draws examples randomly. 3) Several other concept learning models can be extended to a probabilistic framework under the PACS model.
Using Knowledge Building Forums in EFL Classroms - FIETxs2019ARGET URV
1) The document describes a study that examined the impact of using Knowledge Building forums on the development of English language skills for Spanish students.
2) Sixty-seven Spanish students participated in the study, engaging with Knowledge Building forums and completing pre- and post-tests of their English abilities.
3) The results showed that collaborative writing in the forums significantly improved students' English writing skills and comprehension, but did not necessarily improve their vocabulary or specific grammar skills.
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...Lifeng (Aaron) Han
LP&IIS 2013 Presentation PPT. Authors: Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao
In Proceeding of International Conference of Language Processing and Intelligent Information Systems. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57–68, 17 - 18 June 2013, Warsaw, Poland. Springer-Verlag Berlin Heidelberg 2013
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
Parcc public blueprints narrated math 04262013Achieve, Inc.
The document discusses the design of the PARCC summative assessments in mathematics. It outlines three primary task types - Type I, II, and III - that will be used to generate evidence for claims about student performance. Type I tasks assess concepts, skills and procedures and will be machine-scored. Type II tasks assess expressing mathematical reasoning and may include hand-scored responses. Type III tasks assess modeling and applications and may also include hand-scored responses. The Performance Based Assessment will include all three task types, while the End-of-Year Assessment only includes Type I tasks. Evidence statements are used to specify what each task should assess and are derived from the Common Core standards.
Communication systems-theory-for-undergraduate-students-using-matlabSaifAbdulNabi1
Dr. Chandana K.K. Jayasooriya received degrees from the Technical University of Berlin and Wichita State University. He is currently an Assistant Professor at the University of Pittsburgh at Johnstown teaching electrical engineering. The document discusses using MATLAB to teach communication systems theory to undergraduate students in a more intuitive way compared to traditional derivations-heavy approaches. It provides an example using amplitude modulation and shows how concepts like modulation, filtering, and demodulation can be demonstrated in MATLAB without requiring an advanced mathematical background.
This document discusses how AI can help advance various scientific fields such as mathematics, quantum chemistry, biology, and more. It provides examples of how machine learning has helped mathematicians develop new theories by analyzing patterns in examples. It also discusses how AI is helping push the limits of density functional theory in quantum chemistry and how AlphaFold uses transformers and protein multiple sequence alignments to predict structures with near experimental accuracy. The conclusion emphasizes not becoming a slave to models and maintaining inspiration.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
This document summarizes an experiment on using graph-based semi-supervised learning to improve a conditional random field model for Chinese named entity recognition. The experiment used unlabeled data from previous NER tasks to extend the labeled training data via label propagation. This enhanced CRF model was evaluated on a standard test corpus and showed a slight improvement over a closed CRF baseline, particularly for person and organization entities. However, the unlabeled data was not large enough to cover all entity types. Future work could explore using more unlabeled data and optimizing features for the graph construction.
The document discusses various software quality metrics that can be used to assess code, including lines of code, comments, number of methods and fields, coupling, cohesion, inheritance, and cyclomatic complexity. It provides definitions and examples of these metrics, and recommendations on when values may indicate issues, such as methods over 20 lines being difficult to understand or maintain. The metrics can help evaluate the quality, understandability, and maintainability of software.
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while keeping high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A novel distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it eighty times faster and lighter than other typical ensemble methods. All models are evaluated on seven document classification datasets and show significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models.
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...Mohamed Farouk
Semi-supervised learning reduces the cost of labeling the
training data of a supervised learning algorithm through using unlabeled
data together with labeled data to improve the performance. Co-Training
is a popular semi-supervised learning algorithm, that requires multiple redundant
and independent sets of features (views). In many real-world application
domains, this requirement can not be satisfied. In this paper, a
single-view variant of Co-Training, CoBC (Co-Training by Committee),
is proposed, which requires an ensemble of diverse classifiers instead of
the redundant and independent views. Then we introduce two new learning
algorithms, QBC-then-CoBC and QBC-with-CoBC, which combines
the merits of committee-based semi-supervised learning and committeebased
active learning. An empirical study on handwritten digit recognition
is conducted where the random subspace method (RSM) is used to
create ensembles of diverse C4.5 decision trees. Experiments show that
these two combinations outperform the other non committee-based ones.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
This document introduces a series of tutorials for metabolomic data analysis. It discusses important goals like hypothesis generation, data acquisition, processing, exploration, classification and prediction. It covers topics like univariate vs multivariate analysis, data quality metrics, clustering, principal component analysis, partial least squares modeling, and biological interpretation through metabolite enrichment and network mapping. The overall document provides a high-level overview of the key concepts and analytical approaches that will be covered in more detail in the tutorial series.
The document discusses several collaborative filtering techniques for making recommendations:
1) Nearest neighbor techniques like k-NN make predictions based on the ratings of similar users. They require storing all user data but can be fast with appropriate data structures.
2) Naive Bayes classifiers treat each item's ratings independently; they make strong assumptions but require less data.
3) Dimensionality reduction techniques like SVD decompose the user-item rating matrix to find latent factors. Weighted SVD handles missing data.
4) Probabilistic models like mixtures of multinomials and aspect models represent additional user metadata but have more parameters.
The document discusses several collaborative filtering techniques for making recommendations, including k-nearest neighbors (kNN), naive Bayes classification, singular value decomposition (SVD), and probabilistic models. It provides examples of how these methods work, such as using ratings from similar users to predict a user's rating for an item (kNN), and decomposing a ratings matrix to capture relationships between users and items (SVD). The techniques vary in their assumptions, complexity, and ability to incorporate additional user/item metadata. Evaluation on new data is important to ensure the methods generalize well beyond the training data.
Learning analytics and accessibility – #calrg 2015Martyn Cooper
Presentation at the Open University's Computers and Learning Research Group (CALRG) Conference 2015 on Learning Analytics and Accessibility - detecting accessibility deficits with Learning Analytics approaches
The document provides an overview of various machine learning algorithms and methods. It begins with an introduction to predictive modeling and supervised vs. unsupervised learning. It then describes several supervised learning algorithms in detail including linear regression, K-nearest neighbors (KNN), decision trees, random forest, logistic regression, support vector machines (SVM), and naive Bayes. It also briefly discusses unsupervised learning techniques like clustering and dimensionality reduction methods.
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
This document discusses metabolomic data analysis techniques for studying diseases. It analyzes over 13,000 biological samples per year using over 160,000 data points per study. Univariate and multivariate statistical analyses are described, with multivariate being preferred. Techniques include principal component analysis, partial least squares discriminant analysis, hierarchical clustering analysis, and pathway enrichment analysis. Visualization and network mapping tools are also discussed to identify relationships between altered metabolites and treatment effects.
neural based_context_representation_learning_for_dialog_act_classificationJEE HYUN PARK
The document presents a neural network model for dialog act classification that incorporates context representations. It uses a CNN to represent each utterance, applies an internal attention mechanism, and models context with RNNs. As baselines, it uses a single-utterance CNN and concatenation of utterances. Results show RNNs better learn context representations and attention mechanisms improve performance, though the optimal attention placement depends on the dataset. The best-performing models outperform the previous state-of-the-art on benchmark datasets.
Towards advanced data retrieval from learning objects repositoriesValentina Paunovic
This document proposes an advanced search system for learning object repositories. It uses a Steiner trees approach to retrieve groups of related learning objects that satisfy a query, even if no single object matches all query terms. It represents the repository as a sparse weighted graph based on learning object similarities. It also extends the query language with AND and OR operators and presents an algorithm for parsing complex queries. The system aims to enable effective search and reuse of learning materials.
This document summarizes an experiment on measuring relatedness between documents in comparable corpora using distributional similarity measures (DSMs). It finds that DSMs like common tokens (NCT) and Chi-square performed well in filtering out unrelated documents from specialized corpora in English and Italian, but not for the Spanish corpus which appeared to contain less related documents to begin with. The study aims to help automatically describe and evaluate comparable corpora quality by ranking documents based on their relatedness.
This document discusses efficient learning of deterministic finite automata (DFA) from examples. It begins with an introduction to the challenges of learning DFA and different models that have been proposed. The main contributions are: 1) The class of "simple DFA" (those with logarithmic Kolmogorov complexity) is efficiently PAC learnable if examples are drawn from a universal distribution. 2) The entire class of DFA is efficiently PAC learnable under the PACS model where a knowledgeable teacher draws examples randomly. 3) Several other concept learning models can be extended to a probabilistic framework under the PACS model.
Using Knowledge Building Forums in EFL Classroms - FIETxs2019ARGET URV
1) The document describes a study that examined the impact of using Knowledge Building forums on the development of English language skills for Spanish students.
2) Sixty-seven Spanish students participated in the study, engaging with Knowledge Building forums and completing pre- and post-tests of their English abilities.
3) The results showed that collaborative writing in the forums significantly improved students' English writing skills and comprehension, but did not necessarily improve their vocabulary or specific grammar skills.
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...Lifeng (Aaron) Han
LP&IIS 2013 Presentation PPT. Authors: Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao
In Proceeding of International Conference of Language Processing and Intelligent Information Systems. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57–68, 17 - 18 June 2013, Warsaw, Poland. Springer-Verlag Berlin Heidelberg 2013
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
The detection and correction of grammatical errors still represent very hard problems for modern error-correction systems. As an example, the top-performing systems at the preposition correction challenge CoNLL-2013 only achieved a F1 score of 17%.
In this paper, we propose and extensively evaluate a series of approaches for correcting prepositions, analyzing a large body of high-quality textual content to capture language usage. Leveraging n-gram statistics, association measures, and machine learning techniques, our system is able to learn which words or phrases govern the usage of a specific preposition. Our approach makes heavy use of n-gram statistics generated from very large textual corpora. In particular, one of our key features is the use of n-gram association measures (e.g., Pointwise Mutual Information) between words and prepositions to generate better aggregated preposition rankings for the individual n-grams.
We evaluate the effectiveness of our approach using cross-validation with different feature combinations and on two test collections created from a set of English language exams and StackExchange forums. We also compare against state-of-the-art supervised methods. Experimental results from the CoNLL-2013 test collection show that our approach to preposition correction achieves ~30% in F1 score which results in 13% absolute improvement over the best performing approach at that challenge.
Parcc public blueprints narrated math 04262013Achieve, Inc.
The document discusses the design of the PARCC summative assessments in mathematics. It outlines three primary task types - Type I, II, and III - that will be used to generate evidence for claims about student performance. Type I tasks assess concepts, skills and procedures and will be machine-scored. Type II tasks assess expressing mathematical reasoning and may include hand-scored responses. Type III tasks assess modeling and applications and may also include hand-scored responses. The Performance Based Assessment will include all three task types, while the End-of-Year Assessment only includes Type I tasks. Evidence statements are used to specify what each task should assess and are derived from the Common Core standards.
Communication systems-theory-for-undergraduate-students-using-matlabSaifAbdulNabi1
Dr. Chandana K.K. Jayasooriya received degrees from the Technical University of Berlin and Wichita State University. He is currently an Assistant Professor at the University of Pittsburgh at Johnstown teaching electrical engineering. The document discusses using MATLAB to teach communication systems theory to undergraduate students in a more intuitive way compared to traditional derivations-heavy approaches. It provides an example using amplitude modulation and shows how concepts like modulation, filtering, and demodulation can be demonstrated in MATLAB without requiring an advanced mathematical background.
This document discusses how AI can help advance various scientific fields such as mathematics, quantum chemistry, biology, and more. It provides examples of how machine learning has helped mathematicians develop new theories by analyzing patterns in examples. It also discusses how AI is helping push the limits of density functional theory in quantum chemistry and how AlphaFold uses transformers and protein multiple sequence alignments to predict structures with near experimental accuracy. The conclusion emphasizes not becoming a slave to models and maintaining inspiration.
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelLifeng (Aaron) Han
This document summarizes an experiment on using graph-based semi-supervised learning to improve a conditional random field model for Chinese named entity recognition. The experiment used unlabeled data from previous NER tasks to extend the labeled training data via label propagation. This enhanced CRF model was evaluated on a standard test corpus and showed a slight improvement over a closed CRF baseline, particularly for person and organization entities. However, the unlabeled data was not large enough to cover all entity types. Future work could explore using more unlabeled data and optimizing features for the graph construction.
The document discusses various software quality metrics that can be used to assess code, including lines of code, comments, number of methods and fields, coupling, cohesion, inheritance, and cyclomatic complexity. It provides definitions and examples of these metrics, and recommendations on when values may indicate issues, such as methods over 20 lines being difficult to understand or maintain. The metrics can help evaluate the quality, understandability, and maintainability of software.
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while keeping high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A novel distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it eighty times faster and lighter than other typical ensemble methods. All models are evaluated on seven document classification datasets and show significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models.
Combining Committee-Based Semi-supervised and Active Learning and Its Applica...Mohamed Farouk
Semi-supervised learning reduces the cost of labeling the
training data of a supervised learning algorithm through using unlabeled
data together with labeled data to improve the performance. Co-Training
is a popular semi-supervised learning algorithm, that requires multiple redundant
and independent sets of features (views). In many real-world application
domains, this requirement can not be satisfied. In this paper, a
single-view variant of Co-Training, CoBC (Co-Training by Committee),
is proposed, which requires an ensemble of diverse classifiers instead of
the redundant and independent views. Then we introduce two new learning
algorithms, QBC-then-CoBC and QBC-with-CoBC, which combines
the merits of committee-based semi-supervised learning and committeebased
active learning. An empirical study on handwritten digit recognition
is conducted where the random subspace method (RSM) is used to
create ensembles of diverse C4.5 decision trees. Experiments show that
these two combinations outperform the other non committee-based ones.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
This document introduces a series of tutorials for metabolomic data analysis. It discusses important goals like hypothesis generation, data acquisition, processing, exploration, classification and prediction. It covers topics like univariate vs multivariate analysis, data quality metrics, clustering, principal component analysis, partial least squares modeling, and biological interpretation through metabolite enrichment and network mapping. The overall document provides a high-level overview of the key concepts and analytical approaches that will be covered in more detail in the tutorial series.
The document discusses several collaborative filtering techniques for making recommendations:
1) Nearest neighbor techniques like k-NN make predictions based on the ratings of similar users. They require storing all user data but can be fast with appropriate data structures.
2) Naive Bayes classifiers treat each item's ratings independently; they make strong assumptions but require less data.
3) Dimensionality reduction techniques like SVD decompose the user-item rating matrix to find latent factors. Weighted SVD handles missing data.
4) Probabilistic models like mixtures of multinomials and aspect models represent additional user metadata but have more parameters.
The document discusses several collaborative filtering techniques for making recommendations, including k-nearest neighbors (kNN), naive Bayes classification, singular value decomposition (SVD), and probabilistic models. It provides examples of how these methods work, such as using ratings from similar users to predict a user's rating for an item (kNN), and decomposing a ratings matrix to capture relationships between users and items (SVD). The techniques vary in their assumptions, complexity, and ability to incorporate additional user/item metadata. Evaluation on new data is important to ensure the methods generalize well beyond the training data.
Learning analytics and accessibility – #calrg 2015Martyn Cooper
Presentation at the Open University's Computers and Learning Research Group (CALRG) Conference 2015 on Learning Analytics and Accessibility - detecting accessibility deficits with Learning Analytics approaches
The document provides an overview of various machine learning algorithms and methods. It begins with an introduction to predictive modeling and supervised vs. unsupervised learning. It then describes several supervised learning algorithms in detail including linear regression, K-nearest neighbors (KNN), decision trees, random forest, logistic regression, support vector machines (SVM), and naive Bayes. It also briefly discusses unsupervised learning techniques like clustering and dimensionality reduction methods.
High Dimensional Biological Data Analysis and VisualizationDmitry Grapov
This document discusses metabolomic data analysis techniques for studying diseases. It analyzes over 13,000 biological samples per year using over 160,000 data points per study. Univariate and multivariate statistical analyses are described, with multivariate being preferred. Techniques include principal component analysis, partial least squares discriminant analysis, hierarchical clustering analysis, and pathway enrichment analysis. Visualization and network mapping tools are also discussed to identify relationships between altered metabolites and treatment effects.
This document summarizes a research paper on convolutional restricted Boltzmann machines (CRBMs) for feature learning. The paper proposes using CRBMs to learn hierarchical local feature detectors in an unsupervised and generative manner. CRBMs extend regular restricted Boltzmann machines to incorporate spatial locality. The learned features are evaluated on handwritten digit and human detection tasks, achieving results comparable to state-of-the-art. The paper contributes an approach to generative feature learning using CRBMs that can capture spatial relationships in images.
Factor Analysis and Correspondence Analysis Composite and Indicator Scores of...Matthew Powers
Factor analysis and correspondence analysis can be used to generate composite and indicator scores from Likert scale survey data. This allows researchers to distill survey responses down into relevant information about populations and attitudes. The document demonstrates how to calculate these scores in four steps for both factor analysis and correspondence analysis. It also compares the advantages and disadvantages of each method and how the analysis can help organizations think differently about survey data.
Your Classifier is Secretly an Energy based model and you should treat it lik...Seunghyun Hwang
Review : Your Classifier is Secretly an Energy based model and you should treat it like one
- by Seunghyun Hwang (Yonsei University, Severance Hospital, Center for Clinical Data Science)
The culmination of my LEVEL Data Analytics program ('19) efforts. I advised Tyton Partners on what natural next steps to take for an emerging research study aligning success rates in higher-ed with virtual learning. All advanced analysis was conducted via R.
The document summarizes Yan Xu's upcoming presentation at the Houston Machine Learning Meetup on dimension reduction techniques. Yan will cover linear methods like PCA and nonlinear methods such as ISOMAP, LLE, and t-SNE. She will explain how these methods work, including preserving variance with PCA, using geodesic distances with ISOMAP, and modeling local neighborhoods with LLE and t-SNE. Yan will also demonstrate these methods on a dataset of handwritten digits. The meetup is part of a broader roadmap of machine learning topics that will be covered in future sessions.
Two strategies for large-scale multi-label classification on the YouTube-8M d...Dalei Li
The project to participate in the Kaggle YouTube-8M video understanding competition. Four algorithms that can be run on a single machine are implemented, namely, multi-label k-nearest neighbor, multi-label radial basis function network (one-vs-rest), and multi-label logistic regression and on-vs-rest multi-layer neural network.
Robert Grossman and Collin Bennett of the Open Data Group discuss building and deploying big data analytic models. They describe the life cycle of a predictive model from exploratory data analysis to deployment and refinement. Key aspects include generating meaningful features from data, building and evaluating multiple models, and comparing models through techniques like confusion matrices and ROC curves to select the best performing model.
This document discusses various statistical methods for analyzing DNA methylation data, including both global and site-specific analyses. For global analysis, it describes clustering methods like k-means and principle component analysis to identify subgroups with similar methylation profiles. For site-specific analysis, it discusses using linear models and Limma to identify differentially methylated positions and the need to control for multiple testing using methods like Bonferroni correction and false discovery rate. It also mentions checking for influential points that could impact regression results.
This document presents an approach to automatically recover class diagrams from source code by analyzing binary class relationships. It defines consensus definitions for association, aggregation, and composition relationships based on their properties. Algorithms are proposed to recover relationships by analyzing source code statically for properties like invocation site and multiplicity, and dynamically for properties like exclusivity and lifetime. The approach was implemented and shown to accurately recover relationships in case studies.
Online Stochastic Tensor Decomposition for Background Subtraction in Multispe...ActiveEon
Background subtraction is an important task for visual surveillance systems. However, this task becomes more complex when the data size grows since the real-world scenario requires larger data to be processed in a more efficient way, and in some cases, in a continuous manner. Until now, most of background subtraction algorithms were designed for mono or trichromatic cameras within the visible spectrum or near infrared part. Recent advances in multispectral imaging technologies give the possibility to record multispectral videos for video surveillance applications. Due to the specific nature of these data, many of the bands within multispectral images are often strongly correlated. In addition, processing multispectral images with hundreds of bands can be computationally burdensome. In order to address these major difficulties of multispectral imaging for video surveillance, this paper propose an online stochastic framework for tensor decomposition of multispectral video sequences (OSTD). First, the experimental evaluations on synthetic generated data show the robustness of the OSTD with other state of the art approaches then, we apply the same idea on seven multispectral video bands to show that only RGB features are not sufficient to tackle color saturation, illumination variations and shadows problem, but the addition of six visible spectral bands together with one near infra-red spectra provides a better background/foreground separation.
Presentation made during the Intelligent User-Adapted Interfaces: Design and Multi-Modal Evaluation Workshop (IUadaptME) workshop conducted as part of UMAP 2018
This document discusses k-nearest neighbors (KNN) classification, an instance-based machine learning algorithm. KNN works by finding the k training examples closest in distance to a new data point, and assigning the most common class among those k neighbors as the prediction for the new point. The document notes that KNN has high variance, since each data point acts as its own hypothesis. It suggests ways to reduce overfitting, such as using KNN with multiple neighbors (k>1), weighting neighbors by distance, and approximating KNN with data structures like k-d trees.
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET Journal
This document analyzes using the Chi-Square Independence Test for feature selection in Naive Bayes classification. It uses a student performance dataset to test the Chi-Square Independence Test at different confidence intervals for feature selection. The Chi-Square Test is used to determine whether features are independent or associated with the classification attribute. Features with lower p-values have a stronger association. Naive Bayes models are then built using different feature sets selected at different confidence intervals and evaluated based on their accuracy in 2-class and 5-class classifications of student performance. The results show higher accuracy when using grade features and features selected at higher confidence intervals.
Similar to Learning Probabilistic Relational Models using Non-Negative Matrix Factorization (20)
Did you know that drowning is a leading cause of unintentional death among young children? According to recent data, children aged 1-4 years are at the highest risk. Let's raise awareness and take steps to prevent these tragic incidents. Supervision, barriers around pools, and learning CPR can make a difference. Stay safe this summer!
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of May 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Learning Probabilistic Relational Models using Non-Negative Matrix Factorization
1. Anthony Coutant, Philippe Leray, Hoel Le Capitaine
DUKe (Data, User, Knowledge) Team, LINA
26th June, 2014
Learning Probabilistic Relational Models using
Non-Negative Matrix Factorization
7ème Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes
2. 22 / 24
Context
• Probabilistic Relational Models (PRM)
– Attributes uncertainty in Relational datasets
• Relational datasets: attributes + link
• PRM with Reference Uncertainty (RU) model link uncertainty
• Partitioning individuals necessary in PRM-RU
3. 33 / 24
Problem & Proposal
• PRM-RU partition individuals based on attributes only
• We propose to cluster the relationship information instead
• We show that :
– Attributes partitioning do not explain all relationships
– Relational partitioning can explain attributes oriented relationships
4. 44 / 24
Flat datasets – Bayesian Networks
• Individuals supposed i.i.d.
P(G1)
A B
0,25 0,75
P(G2)
A B
0,25 0,75
Dataset
G1 G2 R
A B 1st
B A 1st
B B 2nd
B B 2nd
G1, G2
P(R|G1,G2) A,A A,B B,A B,B
1st division 0,8 0,5 0,5 0,2
2nd division 0,2 0,5 0,5 0,8
Grade 1
Ranking
Grade 2
5. 55 / 24
Relational datasets – Relational schema
Student
Intelligence
Ranking
Registration
Grade
Satisfaction
1,n1
Instance
Schema
Course
Phil101
Difficulté
???
Note
???
Registration
#4563
Note
???
Satisfaction
???
Student
Jane Doe
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
high
Ranking
1st division
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
A
Satisfaction
high
Course
Phil101
Difficulty
high
Evaluation
high
Course
Difficulty
Evaluation
1,n 1
6. 66 / 24
Probabilistic Relational Models (PRM) .
MEAN(G)
P(R|MEAN(G)) A B
1st division 0,8 0,2
2nd division 0,2 0,8
PRM
Schema
Instance
Student
Intelligence
Ranking
Registration
Grade
Satisfaction
1,n1Course
Difficulty
Evaluation
1,n 1
Evaluation Intelligence
Grade
Satisfaction
Difficulty Ranking
Course Registration Student
MEAN
MEAN
Course
Math
Difficulté
???
Note
???
Registration
#6251
Note
???
Satisfaction
???
Student
John Smith
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#5621
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil
Difficulty
???
Evaluation
???
Instance
7. 77 / 24
Probabilistic Relational Models (PRM) ..
MEAN(G)
P(R|MEAN(G)) A B
1st division 0,8 0,2
2nd division 0,2 0,8
PRM
Schema
Course
Math
Difficulté
???
Note
???
Registration
#6251
Note
???
Satisfaction
???
Student
John Smith
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#5621
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil
Difficulty
???
Evaluation
???
Instance
Evaluation Intelligence
Grade
Satisfaction
Difficulty Ranking
Course Registration Student
MEAN
MEAN
Math.Diff
#4563.Grade
#5621.Grade
#6251.Grade
MEAN
GBN (Ground Bayesian Network)
Math.Eval
Phil.Diff
Phil.Eval
#4563.Satis #5621.Satis
#6251.Satis
MEAN
JD.Int
JS.Int
JD.Rank
JS.Rank
MEAN
MEAN
Instance
Student
Intelligence
Ranking
Registration
Grade
Satisfaction
1,n1Course
Difficulty
Evaluation
1,n 1
8. 88 / 24
Uncertainty in Relational datasets
Course
Phil101
Difficulté
???
Note
???
Registration
#4563
Note
???
Satisfaction
???
Student
Jane Doe
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil101
Difficulty
???
Evaluation
???
Student
Jane Doe
Intelligence
???
Ranking
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
A
Satisfaction
???
Course
Phil101
Difficulté
???
Note
???
Course
Phil101
Difficulty
???
Evaluation
high
Course
Phil101
Difficulté
???
Note
???
Registration
#4563
Note
???
Satisfaction
???
Student
Jane Doe
Intelligence
???
Classement
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
???
Satisfaction
???
Course
Phil101
Difficulty
???
Evaluation
???
Student
Jane Doe
Intelligence
???
Ranking
???
Student
Jane Doe
Intelligence
???
Ranking
???
Registration
#4563
Note
???
Satisfaction
???
Registration
#4563
Grade
A
Satisfaction
???
Course
Phil101
Difficulté
???
Note
???
Course
Phil101
Difficulty
???
Evaluation
high
?
Attributes uncertainty (PRM)
Attributes and link uncertainty (PRM extensions)
?
9. 99 / 24
• Reference uncertainty: P(r.Course = ci, r.Student = sj | r.exists = true)
• A random variable for each individual id? Not generalizable
• Solution: partitioning
Difficulty Intelligence
Course Student
Registration
Student
Evaluation RankingCourse
P(Student | Course.Difficulty)?
P(Course)?
PRM with reference uncertainty .
16. 1616 / 24
Experiments – Protocol – Dataset generation
Entity 2
Att 1
…
Att n
R
1,n 1
Entity 1
Att 1
…
Att n
1 1,n
Schema
Instance
Entity 1 Entity 2R
17. 1717 / 24
Experiments – Protocol – Dataset generation
Entity 2
Att 1
…
Att n
R
1,n 1
Entity 1
Att 1
…
Att n
1 1,n
Schema
Instance
Entity 1 Entity 2
Attributes partitioning
favorable case
Relationship partitioning
favorable case
Entity 1 Entity 2
R
R
18. 1818 / 24
Experiments – Protocol – Learning
Entity 1 Entity 2Relation
Att n
Att 1
Att n
Att 1
CE1
CE2
E2
E1
• Parameter learning on set up structure
• 2 PRM compared:
– Either with attributes partitioning
– Or with relational partitioning
19. 1919 / 24
Experiments – Protocol – Evaluation
• For each generated dataset D
– Split D into 10 subsets {D1, …, D10}
– Perform 10 Folds CV each with one Di for test and others for training
• Do it for PRM with attributes partitioning : store the results of 10 log likelihood PattsLL[i]
• Do it for PRM with relationship partitioning : store the results of 10 log likelihood PrelLL[i]
– Evaluate mean and sd of PattsLL[i] and PrelLL[i]
– Evaluate significancy of relationship partitioning over attributes partitioning
20. 2020 / 24
Experiments – Results
Random clusters
(independent from attributes)
k
2 4 16
n
25
50
100
200
Relational > Attributes partitioning
Attributes > Relational partitioning
Partitionings not significantly comparable
k
2 4 16
n
25
50
100
200
Attributes => Cluster
(fully dependent from attributes)
Green:
Red:
Orange:
21. 2121 / 24
Experiments – About the NMF choice for partitioning
• NMF
– Find low dimension factor matrices which product approximates the original matrix
– A relationship between two entities is an adjacency matrix
• Motivation for NMF usage
– (Restrictively) captures latent information from both rows and columns: co-clustering
– Several extensions dedicated to more accurate co-clustering (NMTF)
– Extensions for Laplacian regularization
• Allow to capture both attributes and relationship information for clustering
– Extensions for Tensor factorization
• Allow to model n-ary relationships, n >= 2
– NMF = Good starting choice for the long-term needs?
22. 2222 / 24
Experiments – About the NMF choice for partitioning
• But
– Troubles with performances in experimentations
– Very sensitive to initialization: crashes whenever reaching singular
state
– Moving toward large scale methods : graph based relational
clustering?
23. 2323 / 24
Conclusion
• PRM-RU to define probability structure in relational datasets
• Need for partitioning
• PRM-RU use attributes oriented partitioning
• We propose to cluster the relationship information instead
• Experiments show that :
– Attributes partitioning do not explain all relationships
– Relational partitioning can explain attributes oriented relationships
24. 2424 / 24
Perspectives
• Experiments on real life datasets
• Towards large scale partitioning methods
• PRM-RU Structure Learning using clustering algorithms
• What about other link uncertainty representations?
25. Anthony Coutant, Philippe Leray, Hoel Le Capitaine
DUKe (Data, User, Knowledge) Team, LINA
Questions?
7ème Journées Francophones sur les Réseaux Bayésiens et les Modèles Graphiques Probabilistes
(anthony.coutant | philippe.leray | hoel.lecapitaine)
@univ-nantes.fr
Editor's Notes
Beaucoup de jeux de données relationnels impliquant individus de différents types + diverses relations entre eux.
Besoin d’algorithmes spécifiques pour l’apprentissage, car relaxent hypothèse i.i.d faite généralement
PRM sont une extension des RB pour les données relationnelles
Les PRM classiques gèrent l’incertitude au niveau des attributs de chaque individu, supposant les relations connues
Hors, un jeu de données relationnel comporte à la fois une notion descriptive des entités (attributs) et une dimension topologique (la façon dont les individus sont connectés)
Il est important de gérer l’incertitude topologique en plus de l’incertitude d’attributs
Plusieurs extensions des PRM pour cela : PRM-RU, PRM-EU …
PRM-RU est intéressante car elle expose de façon claire le besoin d’un partitionnement pour s’abstraire d’un jeu de données
On s’intéresse ici à la question du choix d’un partitionnement, et de son impact sur l’apprentissage de ces PRM-RU.
La définition et l’apprentissage d’un PRM-RU nécessite de partitionner les individus impliqués dans une relation.
Toutefois, la seule technique de partitionnement définie dans la littérature repose uniquement sur les attributs des individus.
Conséquence : la topologie réelle de la relation n’est pas prise en compte pour modéliser l’incertitude de liens entre individus !
Nous proposons de partitionner les individus de façon topologique en lieu et place de l’approche historique.
Nous montrons de façon expérimentale que : 1) un partitionnement orienté attributs n’explique pas toutes les relations; 2) un partitionnement relationnel peut expliquer à la fois des relations indépendantes des attributs, mais aussi des relations expliquées par les attributs.
Beaucoup d’algorithmes d’apprentissage automatique supposent que les données considérées sont i.i.d. (identiquement et indépendamment distribuées).
De tels jeux de données peuvent être représentés par des tableaux individus x attribut.
Ici, par exemple, je considère un jeu de données où chaque ligne représente les résultats scolaires d’un étudiant.
Un réseau Bayésien sur ce jeu de données peut exprimer des dépendances / indépendances probabilistes entre les attributs d’un unique étudiant.
Je peux par exemple exprimer le fait que le classement d’un étudiant dépende de ses notes. En revanche, je ne peux pas exprimer la dépendance entre le classement d’un étudiant, et le classement de ses pairs.
Les jeux de données relationnels sont plus complexes.
Un jeu de données ne respecte plus seulement la définition d’un ensemble d’attributs de façon tabulaire, mais obéit aux lois d’un schéma relationnel.
Le schéma relationnel décrit l’ensemble des types d’entités existant dans les données, les attributs décrivant ces entités, et les types de relations possibles entre eux.
Si nous étendons le jeu de données précédent, nous pouvons par exemple définir une relation entre des individus étudiants et des individus cours, de type « inscription », chaque entité et relation étant décrite par divers attributs. Une instance de ce schéma est alors un jeu de données qui respecte ce schéma. Un exemple en est donné en bas.
Un PRM est un modèle graphique probabiliste dirigé défini sur un schéma relationnel.
Il représente un patron de distribution de probabilités défini sur l’ensemble des instances possibles de ce schéma.
Etant donné un PRM et une instance respectant le même schéma relationnel, il est possible de créer un Réseau Bayésien déroulé, mis à plat, appelé « Ground Bayesian Network » qui permet de faire de l’inférence sur les valeurs d’attributs des différents individus du jeu de données considéré.
Les PRM ne gèrent que l’incertitude au niveau des attributs des différents individus, et supposent les relations entre ces individus connus.
Notre but est par exemple, en haut, de trouver quelle est la difficulté probable d’un cours en fonction des différents étudiants qui y sont inscrits, et les notes qu’ils ont obtenus sur ce cours.
Mais il peut être intéressant de gérer le cas où les relations sont également incertaines, pour faire de la détection automatique de liens.
Par exemple, en bas, nous pourrions vouloir trouver l’étudiant le plus probablement attaché à un lien d’inscription concernant un cours particulier, étant donné les caractéristiques du cours, de l’inscription, et des étudiants candidats.
Des extensions des PRM existent pour cela.
Parmi les extensions des PRM, les PRM avec incertitude de référence essaient de modéliser l’incertitude de liens par une distribution de probabilités de type P(endpoints | existence du lien).
Cette extension ajoute pour cela des distributions sur l’ensemble des individus possibles au niveau de chaque relation.
La question à laquelle il faut maintenant répondre concerne le choix de la structure probabiliste.
Naïvement, on pourrait tenter d’apprendre une distribution de probabilités sur l’ensemble des clés des individus impliqués.
Toutefois, cela n’est pas généralisable.
Une solution proposée dans les PRM-RU pour pallier ce problème, est de partitionner les individus impliqués dans la relation.
Nous introduisons donc 2 variables aléatoires au total pour chaque point d’entrée d’une relation.
L’idée est alors de choisir un individu dans un processus en 2 temps : 1) choisir un groupe d’individus selon une distribution de probabilités définie; 2) choisir aléatoirement un individu de ce groupe, selon une loi uniforme.
La complexité du choix est alors réduite car les dimensions des distributions à apprendre au niveau des variables de partitionnement sont très réduites.
Il ne reste plus qu’à effectivement partitionner les individus.
Pour chaque variable « cluster », il faut associer une fonction de partition permettant de regrouper les individus.
La variable « cluster » est alors défini sur l’ensemble des clusters de cette fonction.
Le problème dans les PRM-RU provient de la façon dont sont définies les fonctions de partition, où les seuls attributs des individus de l’entité considérés sont utilisés.
L’hypothèse faite est alors très forte : on considère que la relation est pleinement expliquée par les attributs des types d’entités liés.
Paradoxalement, cela signifie que l’apprentissage des variables permettant de faire de la détection de liens n’utilisent pas directement les liens du jeu de données d’apprentissage.
Cela n’est pas généralisable dans tous les cas.
Prenons par exemple le graphe biparti ci-dessous, instance de la relation binaire définie précédemment entre les étudiants et les cours.
Je considère ici que deux individus d’une même couleur ont la même configuration d’attributs.
Ici, si je tente d’apprendre des distributions sur ma relation, les attributs seuls vont me permettre un bon apprentissage, car tous les cours rouges sont liés à tous les étudiants verts et tous les cours bleus sont liés à tous les étudiants violets. Les distributions de probabilité sont alors informatives et discriminantes.
Si je prends un second exemple, ce n’est plus si clair.
Je vais ainsi toujours apprendre les mêmes distributions de probabilité que précédemment, disant que si un lien est lié à un cours rouge, alors il va nécessairement être lié à un étudiant vert.
Toutefois, cela signifie qu’une fois le choix de la couleur et donc du groupe effectué, je peux prendre aléatoirement n’importe quel individu vert.
Or, la topologie de la relation exhibe 4 composantes connexes différentes, séparant les deux individus rouges et verts, m’informant que chaque individu d’une paire colorée doit être traitée différemment pour cette relation.
Dans le dernier cas, la relation est encore plus mélangée et les distributions apprises ne sont plus informatives.
J’ai des composantes connexes intéressantes, mais elle sont masquées par des distributions de probabilités consensuelles entre les individus colorées.
Notre but est de partitionner les individus d’une relation de telle façon que le nombre de liens à l’intérieur de chaque groupe soit maximal.
À partir de ce partitionnement, notre but est d’apprendre des distributions de probabilités qui reflètent le plus fidèlement possible la topologie de la relation.
Si je reprends le dernier exemple de la slide précédente, je pourrais obtenir un meilleur apprentissage de paramètres en découpant mes individus selon les composantes du graphe, et non pas selon les couleurs des individus.
Nous avons souhaité confronté les deux techniques de partitionnement et leur impact sur l’apprentissage d’un PRM-RU.
Pour cela, nous avons construit deux types de jeux de données. Chacun de ces jeux de données respecte le même schéma relationnel simple, composé d’une seule relation binaire entre deux entités. Chacun de ces jeux de données exhibe différents sous-groupes, avec des préférences fortes pour d’autres sous-groupes de l’autre entité.
La différence entre les deux jeux de données provient de la répartition des individus dans ces sous-groupes.
Le premier jeu de données défini une implication totale entre la valeur des attributs d’un individu et son appartenance à un groupe.
Le second jeu de données défini une indépendance totale entre attributs et groupe affecté.
Le premier jeu est censé favoriser l’approche de partitionnement par attributs, et l’autre est censé favoriser l’approche orientée relation.
La comparaison a été faite entre deux apprentissage de paramètres d’une structure fixée de PRM.
Deux versions de PRM ont été considérées : une avec un partitionnement par attributs, l’autre avec un partitionnement relationnel.
Le protocole d’évaluation est le suivant : 1) découper un jeu de données en 10 parties. 2) Faire un apprentissage en validation croisée en 10 étapes en prenant chaque fois un jeu de données comme jeu de tests. 3) Stocker les 10 log vraisemblances calculées pour le PRM avec partitionnement attributs. Faire de même pour le PRM avec partitionnement relationnel. 4) Evaluer la significativité des résultats par un z-test.
Les résultats obtenus pour chaque jeu de données sont les suivants.
Un carré vert indique que l’approche relationnelle est significativement meilleure que l’approche par attributs. Un carré rouge exprime l’idée inverse. Un carré orange montre une zone où le test statistique ne permet pas de trancher.
Dans le cas d’un jeu de données avec groupes indépendants des attributs, à gauche ici, on peut voir que notre méthode est significativement meilleure sauf lorsque le ratio n / k est très faible. Des phénomènes de sur apprentissage peuvent expliquer ce comportement.
De l’autre côté, lorsque les attributs expliquent une relation, nous ne pouvons pas discerner de schéma clair. Nous pouvons donc dire que notre méthode s’applique autant que la méthode historique pour ces relations.
Notre approche semble donc plus généralisable.
Avant de conclure, je voulais parler de la méthode de partitionnement que nous avons choisi à l’écriture de ce papier pour l’approche relationnelle.
Nous avons choisi une implémentation de NMF minimisant une KL-divergence car c’est une technique de factorisation qui a été montrée équivalente à pLSA, nous donnant une interprétation solide en terme de probabilités pour notre problème.
Nous avons fait ce choix à l’origine pour diverses raisons autres que la seule interprétabilité.
(cf. slide pour la suite)