This document presents AMDC (Active Multi-relational Data Construction), a method for efficiently constructing RDF datasets through active learning and manual annotation. AMDC uses a multi-relational model to predict labels for unlabeled triples and queries annotators for the most informative triples. It repeats this learning and querying process multiple times to construct the dataset with fewer queries. Experiments show AMDC requires 2.4-19x fewer queries than random sampling baselines to construct datasets, and achieves better predictive performance, demonstrating the benefits of active learning and AMDC's design.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
This presentation discusses random forest modeling using the Mahout machine learning library. It describes random forests and the Mahout library. The presentation then demonstrates how to generate a random forest model in Mahout on Hadoop, including data preparation, model training, testing the model, and tuning hyperparameters. It also discusses techniques like selecting random subsets of features to split on and using the Gini index to evaluate splits.
Logistic regression is a machine learning algorithm used for classification. Apache Mahout is a scalable machine learning library that includes an implementation of logistic regression using the stochastic gradient descent algorithm. The document demonstrates how to use Mahout's logistic regression on a sample dataset to classify points based on their features and predict whether they are filled or empty. It shows training a model, evaluating performance on the training data, and selecting additional features to improve the model.
This document summarizes a paper that adapts naive Bayes for label ranking problems. It applies this to algorithm recommendation and ranking financial analysts. It modifies naive Bayes to calculate prior and conditional probabilities based on similarity rather than classification. This outperforms baselines in experiments recommending algorithms and ranking financial analysts based on past performance data. Future work includes handling missing data and adapting it for continuous variables.
This document discusses the history and implementation of regression tree models. It begins by covering early tree models from the 1960s-1980s like CART and GUIDE. It then discusses more modern unified frameworks using modular packages in R like partykit and mob models. The document provides an example using a Bradley-Terry tree to model preferences from paired comparisons. It concludes by discussing potential extensions to deep learning methods.
Introduction to Machine Learning in Python using Scikit-LearnAmol Agrawal
This document outlines a proposed workshop on machine learning in Python using the Scikit-Learn module. The workshop will introduce machine learning concepts and how to use Scikit-Learn to implement supervised and unsupervised machine learning algorithms for classification, regression, dimensionality reduction, and clustering. It will provide example code notebooks and exercises for participants to get hands-on experience applying machine learning to real-world examples and incorporating machine learning into their own work.
This document provides an overview of classification in machine learning. It discusses supervised learning and the classification process. It describes several common classification algorithms including k-nearest neighbors, Naive Bayes, decision trees, and support vector machines. It also covers performance evaluation metrics like accuracy, precision and recall. The document uses examples to illustrate classification tasks and the training and testing process in supervised learning.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
This presentation discusses random forest modeling using the Mahout machine learning library. It describes random forests and the Mahout library. The presentation then demonstrates how to generate a random forest model in Mahout on Hadoop, including data preparation, model training, testing the model, and tuning hyperparameters. It also discusses techniques like selecting random subsets of features to split on and using the Gini index to evaluate splits.
Logistic regression is a machine learning algorithm used for classification. Apache Mahout is a scalable machine learning library that includes an implementation of logistic regression using the stochastic gradient descent algorithm. The document demonstrates how to use Mahout's logistic regression on a sample dataset to classify points based on their features and predict whether they are filled or empty. It shows training a model, evaluating performance on the training data, and selecting additional features to improve the model.
This document summarizes a paper that adapts naive Bayes for label ranking problems. It applies this to algorithm recommendation and ranking financial analysts. It modifies naive Bayes to calculate prior and conditional probabilities based on similarity rather than classification. This outperforms baselines in experiments recommending algorithms and ranking financial analysts based on past performance data. Future work includes handling missing data and adapting it for continuous variables.
This document discusses the history and implementation of regression tree models. It begins by covering early tree models from the 1960s-1980s like CART and GUIDE. It then discusses more modern unified frameworks using modular packages in R like partykit and mob models. The document provides an example using a Bradley-Terry tree to model preferences from paired comparisons. It concludes by discussing potential extensions to deep learning methods.
Introduction to Machine Learning in Python using Scikit-LearnAmol Agrawal
This document outlines a proposed workshop on machine learning in Python using the Scikit-Learn module. The workshop will introduce machine learning concepts and how to use Scikit-Learn to implement supervised and unsupervised machine learning algorithms for classification, regression, dimensionality reduction, and clustering. It will provide example code notebooks and exercises for participants to get hands-on experience applying machine learning to real-world examples and incorporating machine learning into their own work.
This document provides an overview of classification in machine learning. It discusses supervised learning and the classification process. It describes several common classification algorithms including k-nearest neighbors, Naive Bayes, decision trees, and support vector machines. It also covers performance evaluation metrics like accuracy, precision and recall. The document uses examples to illustrate classification tasks and the training and testing process in supervised learning.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
t-tests in R - Lab slides for UGA course FANR 6750richardchandler
This document outlines a lab on summary statistics, graphics, and t-tests in R. It introduces topics like importing data, creating graphics like boxplots and histograms, and performing different types of t-tests (two-sample t-test assuming equal variances, paired t-test, equality of variances test) to compare two samples and determine if they came from the same population. Exercises are provided to have students practice these skills, and an assignment asks them to write an R script to conduct and comment on the results of various t-tests.
This document discusses preparing data for analysis. It covers the need for data exploration including validation, sanitization, and treatment of missing values and outliers. The main steps in statistical data analysis are also presented. Specific techniques discussed include calculating frequency counts and descriptive statistics to understand the distribution and characteristics of variables in a loan data set with 250,000 observations. SAS procedures like Proc Freq, Proc Univariate, and Proc Means are demonstrated for exploring the data.
The document summarizes statistical pattern recognition techniques. It is divided into 9 sections that cover topics like dimensionality reduction, classifiers, classifier combination, and unsupervised classification. The goal of pattern recognition is supervised or unsupervised classification of patterns based on features. Dimensionality reduction aims to reduce the number of features to address the curse of dimensionality when samples are limited. Multiple classifiers can be combined through techniques like stacking, bagging, and boosting. Unsupervised classification uses clustering algorithms to construct decision boundaries without labeled training data.
Dictionary Learning for Massive Matrix Factorizationrecsysfr
The document presents a new algorithm called Subsampled Online Dictionary Learning (SODL) for solving very large matrix factorization problems with missing values efficiently. SODL adapts an existing online dictionary learning algorithm to handle missing values by only using the known ratings for each user, allowing it to process large datasets with billions of ratings in linear time with respect to the number of known ratings. Experiments on movie rating datasets show that SODL achieves similar prediction accuracy as the fastest existing solver but with a speed up of up to 6.8 times on the largest Netflix dataset tested.
Machine learning algorithms can be used to make predictions from data. There are several types of algorithms for supervised learning tasks like regression and classification, as well as unsupervised learning tasks like clustering and dimensionality reduction. The scikit-learn library provides popular machine learning algorithms and datasets that can be used to fit models to data and validate performance. Key steps in the machine learning process include getting data, selecting an algorithm, fitting the model to training data, and evaluating performance on test data to avoid overfitting or underfitting. Performance metrics like precision, recall, and F1 score are used to quantify how well models generalize to new data.
Machine Learning : why we should know and how it worksKevin Lee
This document provides an overview of machine learning, including:
- An introduction to machine learning and why it is important.
- The main types of machine learning algorithms: supervised learning, unsupervised learning, and deep neural networks.
- Examples of how machine learning algorithms work, such as logistic regression, support vector machines, and k-means clustering.
- How machine learning is being applied in various industries like healthcare, commerce, and more.
The document provides an introduction to the R programming language. It discusses that R is an open-source programming language for statistical analysis and graphics. It can run on Windows, Unix and MacOS. The document then covers downloading and installing R and R Studio, the R workspace, basics of R syntax like naming conventions and assignments, working with data in R including importing, exporting and creating calculated fields, using R packages and functions, and resources for R help and tutorials.
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
The document discusses recommender systems and sequential recommendation problems. It covers several key points:
1) Matrix factorization and collaborative filtering techniques are commonly used to build recommender systems, but have limitations like cold start problems and how to incorporate additional constraints.
2) Sequential recommendation problems can be framed as multi-armed bandit problems, where past recommendations influence future recommendations.
3) Various bandit algorithms like UCB, Thompson sampling, and LinUCB can be applied, but extending guarantees to models like matrix factorization is challenging. Offline evaluation on real-world datasets is important.
This document provides an overview of machine learning techniques using R. It discusses regression, classification, linear models, decision trees, neural networks, genetic algorithms, support vector machines, and ensembling methods. Evaluation metrics and algorithms like lm(), rpart(), nnet(), ksvm(), and ga() are presented for different machine learning tasks. The document also compares inductive learning, analytical learning, and explanation-based learning approaches.
This document discusses XGBoost, an optimized distributed gradient boosting library. It begins by explaining what XGBoost can do, including binary classification, multiclass classification, regression, and learning to rank. It then discusses boosted trees and their variants like GBDT, GBRT, and MART. It explains how tree ensembles work by combining many decision trees to make predictions and describes XGBoost's additive training process of greedily adding trees to minimize loss. It also covers XGBoost's efficient splitting algorithm for growing trees and references for further information.
This document provides an overview of neural networks in R. It begins with recapping logistic regression and decision boundaries. It then discusses how neural networks allow for non-linear decision boundaries through the use of intermediate outputs and multiple logistic regression models. Code examples are provided to demonstrate building neural networks with intermediate outputs to classify data with non-linear decision boundaries.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This document discusses unsupervised learning techniques including principal component analysis (PCA), generalized low rank models (GLRM), K-means clustering, and deep learning autoencoders. PCA reduces dimensionality by identifying principal components that explain the most variance in the data. GLRM generalizes PCA to work with mixed data types. K-means clustering groups similar observations to identify homogeneous clusters. Autoencoders can detect anomalies through reconstructing input data. The document provides examples of applying these techniques to vehicle insurance data.
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
Giorgio Alfredo Spedicato will give a presentation on machine learning and actuarial science. He will review machine learning theory, including unsupervised and supervised learning algorithms. He will provide examples using various datasets, including using unsupervised learning on an auto insurance dataset and supervised learning for credit scoring and claim severity prediction. Spedicato has experience as a data scientist and actuary and holds a PhD in Actuarial Science.
Relational algebra allows querying relational databases using a set of operators. Key operators include selection (σ) to filter tuples, projection (π) to select attributes, and join (×) to combine relations. More complex queries can be built by combining multiple operators. The division operator (/) is used to find tuples that have a relationship to all tuples in another relation. While not directly supported, division queries can be computed from other relational algebra operations and their complements.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
t-tests in R - Lab slides for UGA course FANR 6750richardchandler
This document outlines a lab on summary statistics, graphics, and t-tests in R. It introduces topics like importing data, creating graphics like boxplots and histograms, and performing different types of t-tests (two-sample t-test assuming equal variances, paired t-test, equality of variances test) to compare two samples and determine if they came from the same population. Exercises are provided to have students practice these skills, and an assignment asks them to write an R script to conduct and comment on the results of various t-tests.
This document discusses preparing data for analysis. It covers the need for data exploration including validation, sanitization, and treatment of missing values and outliers. The main steps in statistical data analysis are also presented. Specific techniques discussed include calculating frequency counts and descriptive statistics to understand the distribution and characteristics of variables in a loan data set with 250,000 observations. SAS procedures like Proc Freq, Proc Univariate, and Proc Means are demonstrated for exploring the data.
The document summarizes statistical pattern recognition techniques. It is divided into 9 sections that cover topics like dimensionality reduction, classifiers, classifier combination, and unsupervised classification. The goal of pattern recognition is supervised or unsupervised classification of patterns based on features. Dimensionality reduction aims to reduce the number of features to address the curse of dimensionality when samples are limited. Multiple classifiers can be combined through techniques like stacking, bagging, and boosting. Unsupervised classification uses clustering algorithms to construct decision boundaries without labeled training data.
Dictionary Learning for Massive Matrix Factorizationrecsysfr
The document presents a new algorithm called Subsampled Online Dictionary Learning (SODL) for solving very large matrix factorization problems with missing values efficiently. SODL adapts an existing online dictionary learning algorithm to handle missing values by only using the known ratings for each user, allowing it to process large datasets with billions of ratings in linear time with respect to the number of known ratings. Experiments on movie rating datasets show that SODL achieves similar prediction accuracy as the fastest existing solver but with a speed up of up to 6.8 times on the largest Netflix dataset tested.
Machine learning algorithms can be used to make predictions from data. There are several types of algorithms for supervised learning tasks like regression and classification, as well as unsupervised learning tasks like clustering and dimensionality reduction. The scikit-learn library provides popular machine learning algorithms and datasets that can be used to fit models to data and validate performance. Key steps in the machine learning process include getting data, selecting an algorithm, fitting the model to training data, and evaluating performance on test data to avoid overfitting or underfitting. Performance metrics like precision, recall, and F1 score are used to quantify how well models generalize to new data.
Machine Learning : why we should know and how it worksKevin Lee
This document provides an overview of machine learning, including:
- An introduction to machine learning and why it is important.
- The main types of machine learning algorithms: supervised learning, unsupervised learning, and deep neural networks.
- Examples of how machine learning algorithms work, such as logistic regression, support vector machines, and k-means clustering.
- How machine learning is being applied in various industries like healthcare, commerce, and more.
The document provides an introduction to the R programming language. It discusses that R is an open-source programming language for statistical analysis and graphics. It can run on Windows, Unix and MacOS. The document then covers downloading and installing R and R Studio, the R workspace, basics of R syntax like naming conventions and assignments, working with data in R including importing, exporting and creating calculated fields, using R packages and functions, and resources for R help and tutorials.
Safe and Efficient Off-Policy Reinforcement Learningmooopan
This document summarizes the Retrace(λ) reinforcement learning algorithm presented by Remi Munos, Thomas Stepleton, Anna Harutyunyan and Marc G. Bellemare. Retrace(λ) is an off-policy multi-step reinforcement learning algorithm that is safe (converges for any policy), efficient (makes best use of samples when policies are close), and has lower variance than importance sampling. Empirical results on Atari 2600 games show Retrace(λ) outperforms one-step Q-learning and existing multi-step methods.
The document discusses recommender systems and sequential recommendation problems. It covers several key points:
1) Matrix factorization and collaborative filtering techniques are commonly used to build recommender systems, but have limitations like cold start problems and how to incorporate additional constraints.
2) Sequential recommendation problems can be framed as multi-armed bandit problems, where past recommendations influence future recommendations.
3) Various bandit algorithms like UCB, Thompson sampling, and LinUCB can be applied, but extending guarantees to models like matrix factorization is challenging. Offline evaluation on real-world datasets is important.
This document provides an overview of machine learning techniques using R. It discusses regression, classification, linear models, decision trees, neural networks, genetic algorithms, support vector machines, and ensembling methods. Evaluation metrics and algorithms like lm(), rpart(), nnet(), ksvm(), and ga() are presented for different machine learning tasks. The document also compares inductive learning, analytical learning, and explanation-based learning approaches.
This document discusses XGBoost, an optimized distributed gradient boosting library. It begins by explaining what XGBoost can do, including binary classification, multiclass classification, regression, and learning to rank. It then discusses boosted trees and their variants like GBDT, GBRT, and MART. It explains how tree ensembles work by combining many decision trees to make predictions and describes XGBoost's additive training process of greedily adding trees to minimize loss. It also covers XGBoost's efficient splitting algorithm for growing trees and references for further information.
This document provides an overview of neural networks in R. It begins with recapping logistic regression and decision boundaries. It then discusses how neural networks allow for non-linear decision boundaries through the use of intermediate outputs and multiple logistic regression models. Code examples are provided to demonstrate building neural networks with intermediate outputs to classify data with non-linear decision boundaries.
MOA is a framework for online machine learning from data streams. It includes algorithms for classification, regression, clustering and frequent pattern mining that can incorporate data and update models on the fly. MOA is closely related to WEKA and includes tools for evaluating streaming algorithms on data from sensors and IoT devices. It provides an environment for designing and running experiments on streaming machine learning algorithms at massive scales.
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
This document discusses unsupervised learning techniques including principal component analysis (PCA), generalized low rank models (GLRM), K-means clustering, and deep learning autoencoders. PCA reduces dimensionality by identifying principal components that explain the most variance in the data. GLRM generalizes PCA to work with mixed data types. K-means clustering groups similar observations to identify homogeneous clusters. Autoencoders can detect anomalies through reconstructing input data. The document provides examples of applying these techniques to vehicle insurance data.
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
Giorgio Alfredo Spedicato will give a presentation on machine learning and actuarial science. He will review machine learning theory, including unsupervised and supervised learning algorithms. He will provide examples using various datasets, including using unsupervised learning on an auto insurance dataset and supervised learning for credit scoring and claim severity prediction. Spedicato has experience as a data scientist and actuary and holds a PhD in Actuarial Science.
Relational algebra allows querying relational databases using a set of operators. Key operators include selection (σ) to filter tuples, projection (π) to select attributes, and join (×) to combine relations. More complex queries can be built by combining multiple operators. The division operator (/) is used to find tuples that have a relationship to all tuples in another relation. While not directly supported, division queries can be computed from other relational algebra operations and their complements.
This document discusses dimensionality reduction using principal component analysis (PCA). It explains that PCA is used to reduce the number of variables in a dataset while retaining the variation present in the original data. The document outlines the PCA algorithm, which transforms the original variables into new uncorrelated variables called principal components. It provides an example of applying PCA to reduce data from 2D to 1D. The document also discusses key PCA concepts like covariance matrices, eigenvalues, eigenvectors, and transforming data to the principal component coordinate system. Finally, it presents an assignment applying PCA and classification to a handwritten digits dataset.
Robert Grossman and Collin Bennett of the Open Data Group discuss building and deploying big data analytic models. They describe the life cycle of a predictive model from exploratory data analysis to deployment and refinement. Key aspects include generating meaningful features from data, building and evaluating multiple models, and comparing models through techniques like confusion matrices and ROC curves to select the best performing model.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
This document provides 6 tips for optimizing topic models like LDA for better interpretability: 1) Identify phrases through n-grams and filter for noun structures to help cluster topics. 2) Filter remaining words for nouns to extract more interpretable topics. 3) Optimize the number of topics through a coherence measure, which typically peaks around 15 topics. 4) Adjust LDA hyperparameters like iterations and passes to improve topic coherence. 5) Manually inspect topics and remove redundant words. 6) Present top words and sample documents for each topic to aid human interpretation.
This document discusses generative adversarial networks (GANs) and their applications. It begins with an overview of generative models including variational autoencoders and GANs. GANs use two neural networks, a generator and discriminator, that compete against each other in a game theoretic framework. The generator learns to generate fake samples to fool the discriminator, while the discriminator learns to distinguish real and fake samples. Applications discussed include image-to-image translation using conditional GANs to map images from one domain to another, and text-to-image translation using GANs to generate images from text descriptions.
The document discusses using machine learning techniques like reinforcement learning and generative adversarial networks to improve query optimization in databases. Specifically, it summarizes work using deep Q-learning (DQ) and a neural optimizer (Neo) to learn join ordering, as well as using intra-query learning with SkinnerDB. It proposes using generative adversarial networks and Monte Carlo tree search to address shortcomings in existing approaches like lack of training data and balancing exploration vs exploitation. Generative adversarial networks could generate additional training data while Monte Carlo tree search would help optimize join ordering on a per-query basis.
ChatGPT
Data analysis is the process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves applying various techniques and methods to extract insights from data sets, often with the goal of uncovering patterns, trends, relationships, or making predictions.
Here's an overview of the key steps and techniques involved in data analysis:
Data Collection: The first step in data analysis is gathering relevant data from various sources. This can include structured data from databases, spreadsheets, or surveys, as well as unstructured data such as text documents, social media posts, or sensor readings.
Data Cleaning and Preprocessing: Once the data is collected, it often needs to be cleaned and preprocessed to ensure its quality and suitability for analysis. This involves handling missing values, removing duplicates, addressing inconsistencies, and transforming data into a suitable format for analysis.
Exploratory Data Analysis (EDA): EDA involves examining and understanding the data through summary statistics, visualizations, and statistical techniques. It helps identify patterns, distributions, outliers, and potential relationships between variables. EDA also helps in formulating hypotheses and guiding further analysis.
Data Modeling and Statistical Analysis: In this step, various statistical techniques and models are applied to the data to gain deeper insights. This can include descriptive statistics, inferential statistics, hypothesis testing, regression analysis, time series analysis, clustering, classification, and more. The choice of techniques depends on the nature of the data and the research questions being addressed.
Data Visualization: Data visualization plays a crucial role in data analysis. It involves creating meaningful and visually appealing representations of data through charts, graphs, plots, and interactive dashboards. Visualizations help in communicating insights effectively and spotting trends or patterns that may be difficult to identify in raw data.
Interpretation and Conclusion: Once the analysis is performed, the findings need to be interpreted in the context of the problem or research objectives. Conclusions are drawn based on the results, and recommendations or insights are provided to stakeholders or decision-makers.
Reporting and Communication: The final step is to present the results and findings of the data analysis in a clear and concise manner. This can be in the form of reports, presentations, or interactive visualizations. Effective communication of the analysis results is crucial for stakeholders to understand and make informed decisions based on the insights gained.
Data analysis is widely used in various fields, including business, finance, marketing, healthcare, social sciences, and more. It plays a crucial role in extracting value from data, supporting evidence-based decision-making, and driving actionable insig
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
We have reached a remarkable point in history with the evolution of AI, from applying this technology to incredible use cases in healthcare, to addressing the world's biggest humanitarian and environmental issues. Our ability to learn task-specific functions for vision, language, sequence and control tasks is getting better at a rapid pace. This talk will survey some of the current advances in AI, compare AI to other fields that have historically developed over time, and calibrate where we are in the relative advancement timeline. We will also speculate about the next inflection points and capabilities that AI can offer down the road, and look at how those might intersect with other emergent fields, e.g. Quantum computing.
This document provides an overview of dimensionality reduction techniques including PCA and manifold learning. It discusses the objectives of dimensionality reduction such as eliminating noise and unnecessary features to enhance learning. PCA and manifold learning are described as the two main approaches, with PCA using projections to maximize variance and manifold learning assuming data lies on a lower dimensional manifold. Specific techniques covered include LLE, Isomap, MDS, and implementations in scikit-learn.
Software Defect Prediction on Unlabeled DatasetsSung Kim
The document describes techniques for software defect prediction when labeled training data is unavailable. It proposes Transfer Defect Learning (TCA+) to improve cross-project defect prediction by normalizing data distributions between source and target projects. For projects with heterogeneous metrics, it introduces Heterogeneous Defect Prediction (HDP) which matches similar metrics between source and target to build cross-project prediction models. It also discusses CLAMI for defect prediction using only unlabeled data without human effort. The techniques are evaluated on open source projects to demonstrate their effectiveness compared to traditional cross-project and within-project prediction.
The document discusses ensemble clustering methods. It begins by comparing classification and clustering, noting that clustering differs in that ground truth labels are not known beforehand. It then discusses how ensemble clustering can improve upon single clustering algorithms by generating multiple partitions and combining them. The key steps are: 1) generating an ensemble of initial partitions from clustering the data multiple times, 2) aligning the initial partitions into metaclusters, and 3) voting to determine a final clustering assignment. This approach provides benefits of scalability and robustness over single clustering algorithms.
Malicious software are categorized into families based on
their static and dynamic characteristics, infection methods, and nature of threat. Visual exploration of malware instances and families in a low dimensional space helps in giving a first overview about dependencies and
relationships among these instances, detecting their groups and isolating outliers. Furthermore, visual exploration of different sets of features is useful in assessing the quality of these sets to carry a valid abstract representation, which can be later used in classification and clustering algorithms to achieve a high accuracy. We investigate one of
the best dimensionality reduction techniques known as t-SNE to reduce the malware representation from a high dimensional space consisting of
thousands of features to a low dimensional space. We experiment with
different feature sets and depict malware clusters in 2-D.
Online advertising and large scale model fittingWush Wu
This document discusses online advertising and techniques for fitting large-scale models to advertising data. It outlines batch and online algorithms for logistic regression, including parallelizing existing batch algorithms and stochastic gradient descent. The document also discusses using alternating direction method of multipliers and follow the proximal regularized leader to fit models to large datasets across multiple machines. It provides examples of how major companies like LinkedIn and Facebook implement hybrid online-batch algorithms at large scale.
This presentation material provides an introduction to graph grammar and its application to learning a graph generative model. Presented at IBIS 2019, Nagoya, Japan.
The document proposes an instance clipping protocol (ICP) to preserve privacy in crowdsourcing. ICP exploits the locality of tasks and sensitive information by clipping instances into sub-instances. It evaluates ICP using metrics that measure task quality and privacy invasion based on a worker sampling model. An experiment applies ICP and these metrics to a face detection task, showing they capture the tradeoff between quality and privacy for different clipping window sizes.
Preserving Worker Privacy in CrowdsourcingHiroshi Kajino
This document proposes a method for preserving worker privacy in crowdsourcing quality control. It introduces the problem of inferring sensitive worker information from crowdsourced labels during quality control. The authors propose a worker-private quality control problem setting and a latent class model-based method using secure computation to address this. Experiments are discussed to validate the approximation accuracy and computational overhead of the privacy-preserving approach.
The document proposes a new method called the Clustered Personal Classifier (CPC) method to address quality control issues when learning from crowdsourced data. The CPC method focuses on the similarity between workers by modeling their relationships and fusing similar workers to reduce parameters and provide more robust estimation compared to existing personal classifier methods. Experimental results on both synthetic and real-world data demonstrate that the CPC method outperforms other methods and can detect outlier workers through clustering analysis without the need for validation data.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Active Learning for Multi-relational Data Construction
1. Active Learning for
Multi-relational Data Construction
Hiroshi Kajino1, Akihiro Kishimoto2, Adi Botea2
Elizabeth Daly2, Spyros Kotoulas2
1: The University of Tokyo, Japan, 2: IBM Research - Ireland
1
2. /28
■ Research focus: Manual RDF data construction
□ Some data are difficult to extract automatically from docs
Q: How can we efficiently construct the dataset by hands?
■ Our solution: Active learning + multi-relational learning
□ Reduce the number of queries as much as possible
2
We develop a method to support hand RDF data annotation
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
3. /28
■ Outline
□ Problem settings:
• Multi-relational (RDF) data and their applications
• Two formulations:
– Dataset construction problem
– Predictive model construction problem
□ Our solution (AMDC):
• Active learning
• Multi-relational learning
□ Experiments
3
4. /28
■ Outline
□ Problem settings:
• Multi-relational (RDF) data and their applications
• Two formulations:
– Dataset construction problem
– Predictive model construction problem
□ Our solution (AMDC):
• Active learning
• Multi-relational learning
□ Experiments
4
5. /28
■ Multi-relational dataset (RDF format)
□ Triple: t = (i, j, k)
• Entity: i, j ∈ E
• Relation: k ∈ R
□ Label:
• t is positive Entity i is in relation k with entity j
• t is negative Entity i is not in relation k with entity j
□ Multi-relational dataset: (Δp, Δn)
Δp = {t ∈ Δ | t is positive}, Δn = {t ∈ Δ | t is negative}
• Assume: |Δp| ≪ |Δ|, some triples remain unlabeled
5
Multi-relational dataset consists of binary-labeled triples
Dog Animal
Human
is a part ofis the same as
is a part of
Set of all the triples
6. /28
■ Motivation of manual construction
□ Knowledge base: Human knowledge encoded in RDF
Point: Commonsense knowledge rarely appears in docs
→ Difficult to extract it automatically from documents
□ Biological dataset:
→ Some unknown triples require experiments for labeling
6
Some RDF datasets require hand annotation by nature
Dataset Positive triple example
WordNet [Miller, 95] (dog, canine, synset), (dog, poodle, hypernym)
ConceptNet [Liu+,04]
(saxophone, jazz, UsedFor),
(learn, knowledge, MotivatedByGoal)
interact participate
Protein DNA Cell cycle
7. /28
■ Two problem formulations
□ Inputs:
• Set of entities E, relations R, annotator O: Δ→{+1,-1}
□ Problem 1: Dataset construction problem
• Output: Positive triples Δp
• Note: Positive triples are usually quite few
□ Problem 2: Predictive model construction problem
• Output: Multi-relational model M: Δ→R
• Note: The model can predict labels of unlabeled triples
※ More direct formulation than Prob. 1 if the model is the goal
7
Two problem settings reflect different usages of a dataset
・ No error
・ B times access
Degree of
“positiveness”
8. /28
■ Outline
□ Problem settings:
• Multi-relational (RDF) data and their applications
• Two formulations:
– Dataset construction problem
– Predictive model construction problem
□ Our solution (AMDC):
• Active learning
• Multi-relational learning
□ Experiments
8
9. /28
■ Active Multi-relational Data Construction
□ Overview:
9
Our solution, AMDC, repeats learning and querying B times
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
Training dataset (Δp, Δn)
Train the model using the current training dataset
10. /28
■ Active Multi-relational Data Construction
□ Overview:
10
Our solution, AMDC, repeats learning and querying B times
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
Training dataset (Δp, Δn)
AMDC is able to compute predictive score st (t ∈ Δu):
Larger/smaller st model believes t is pos/neg
11. /28
■ Active Multi-relational Data Construction
□ Overview:
11
Our solution, AMDC, repeats learning and querying B times
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
Training dataset (Δp, Δn)
Compute query score qt (t ∈ Δu) using st
Smaller qt t is informative for dataset construction
12. /28
■ Active Multi-relational Data Construction
□ Overview:
12
Our solution, AMDC, repeats learning and querying B times
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
Training dataset (Δp, Δn)
13. /28
■ Active Multi-relational Data Construction
□ Details:
• Query scores qt
• Multi-relational model, predictive score st
13
We explain the details of AMDC in 2 parts
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
14. /28
■ Active Multi-relational Data Construction
□ Details:
• Query scores qt
• Multi-relational model, predictive score st
14
We explain the details of AMDC in 2 parts
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
15. /28
■ AMDC (1/2): Query scores
□ Given: predictive score st, threshold 0
s.t. st > 0 (< 0) model believes t is positive (negative)
□ Query score qt (t ∈ Δ):
Query the label on triples {t} w/ smallest qt
• Positiveness score (for Problem 1): qt := - st
Choose triples the model believes to be positive
• Uncertainty score (for Problem 2): qt = |st |
Choose triples that the model is uncertain
※ AMDC handles two problems just by switching the query score
15
We employ two different query scores for the two problems
pos
neg
st
0
16. /28
■ Active Multi-relational Data Construction
□ Details:
• Query scores qt
• Multi-relational model, predictive score st
16
We explain the details of AMDC in 2 parts
Multi-relational
model Annotators1. Query labels of
informative triples
2. Return labels
3. Update the dataset & retrain the model
17. /28
■ AMDC (2/2): Multi-relational model
□ RESCAL [Nickel+,11]:
• Model:
– ai ∈ RD : Latent vector of entity i
– Rk ∈ RD D : Latent matrix of relation k
• Predictive score: st = ai
T Rk aj
Large/small st t is likely to be positive/negative
□ Additional constraints: |ai| = 1, Rk = rotation matrix
• Reduce the degree of freedom
• Stabilize learning in case of small labels (at the beginning)
(→ experiments)
17
We add two constraints to RESCAL to avoid overfitting
New
18. /28
■ AMDC (2/2): Optimization problem for learning
18
Pros Cons
pos AUC-loss
s(pos) > s(non-pos)
- Robust to pos/neg ratio
- Unlabeled triples are used
- Neg is not explicitly used
- No threshold for pos/neg
neg AUC-loss
s(non-neg) > s(neg)
Neg triples are explicitly used
(→ experiments)
No threshold between pos/neg
Classification
error
s(pos) > 0
s(neg) < 0
- Threshold between pos/neg
→ Able to compute
the uncertainty score
- Non-robust to pos/neg ratio
- Difficult to use unlabeled triples
Two objective functions are added to overcome the cons
min pos AUC-loss + neg AUC-loss + classification loss
New
New
pos
st
unlabeled
neg
19. /28
■ AMDC (2/2): Optimization problem for learning
19
Pros Cons
pos AUC-loss
s(pos) > s(non-pos)
- Robust to pos/neg ratio
- Unlabeled triples are used
- Neg is not explicitly used
- No threshold for pos/neg
neg AUC-loss
s(non-neg) > s(neg)
Neg triples are explicitly used
(→ experiments)
No threshold between pos/neg
Classification
error
s(pos) > 0
s(neg) < 0
- Threshold between pos/neg
→ Able to compute
the uncertainty score
- Non-robust to pos/neg ratio
- Difficult to use unlabeled triples
Two objective functions are added to overcome the cons
min pos AUC-loss + neg AUC-loss + classification loss
New
New
neg
st
pos
unlabeled
pos
st
unlabeled
neg
+
20. /28
■ AMDC (2/2): Optimization problem for learning
20
Pros Cons
pos AUC-loss
s(pos) > s(non-pos)
- Robust to pos/neg ratio
- Unlabeled triples are used
- Neg is not explicitly used
- No threshold for pos/neg
neg AUC-loss
s(non-neg) > s(neg)
Neg triples are explicitly used
(→ experiments)
No threshold between pos/neg
Classification
error
s(pos) > 0
s(neg) < 0
- Threshold between pos/neg
→ Able to compute
the uncertainty score
- Non-robust to pos/neg ratio
- Difficult to use unlabeled triples
Two objective functions are added to overcome the cons
min pos AUC-loss + neg AUC-loss + classification loss
New
New
neg
st
pos
unlabeled
pos
st
unlabeled
neg
pos
neg
st
unlabeled = +
21. /28
■ AMDC (2/2): Optimization problem for learning
21
Pros Cons
pos AUC-loss
s(pos) > s(non-pos)
- Robust to pos/neg ratio
- Unlabeled triples are used
- Neg is not explicitly used
- No threshold for pos/neg
neg AUC-loss
s(non-neg) > s(neg)
Neg triples are explicitly used
(→ experiments)
No threshold between pos/neg
Classification
loss
s(pos) > 0
s(neg) < 0
- Threshold between pos/neg
→ Able to compute
the uncertainty score
- Non-robust to pos/neg ratio
- Difficult to use unlabeled triples
Two objective functions are added to overcome the cons
min pos AUC-loss + neg AUC-loss + classification loss
New
New
pos
neg
st
unlabeled0
22. /28
■ AMDC (2/2): Optimization problem
□ Algorithm: Stochastic gradient descent
□ Parameters:
□ Hyperparameters: γ, γ’, Cn, Ce, D
At each iteration, we choose the best model by using a val set
22
Margin-based loss functions are optimized using SGD
s(pos) > s(non-pos)
s(non-neg) > s(neg)
s(pos) > 0 s(neg) < 0
,
23. /28
■ Outline
□ Problem settings:
• Multi-relational (RDF) data and their applications
• Two formulations:
– Dataset construction problem
– Predictive model construction problem
□ Our solution (AMDC):
• Active learning
• Multi-relational learning
□ Experiments
23
24. /28
■ Experiments
□ Purpose: Evaluate 3 contributions of AMDC in two problems
• Query scores (vs. AMDC + random query)
• Constraints on RESCAL (vs. AMDC - constraints)
• neg AUC-loss (vs. AMDC - neg-AUC)
□ Datasets:
• Annotators are simulated
24
We evaluate 3 modifications using partial AMDCs
#(Entity) #(Relation) #(Pos) #(Neg)
Kinships [Denham, 73] 104 26 10,790 270,426
Nations [Rummel, 50-65] 125 57 2,565 8,626
UMLS [McCray,03] 135 49 6,752 886,273
25. /28
■ Experiments (1/2): Dataset construction problem
Score: %(pos triples collected by AMDC)
□ AMCD shows 2.4 – 19 times improvements over Random
□ Negative triples are helpful when they are abundant (K, U)
□ Effects of the constraints are incremental
25
AMDC has collected 2.4 – 19 times as many positive triples as baselines
10 trials, (Q, q) = (105, 103) ((2 103,102) for Nations)
Nations
0 200 400 600 800 1000 1200 1400 1600
#(Queries)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Completionrate
AMDC
AMDC rand
AMDC pos only
AMDC no const
UMLS
0 2000 4000 6000 8000 10000 12000 14000 16000
#(Queries)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Completionrate
AMDC
AMDC rand
AMDC pos only
AMDC no const
Kinships
0 2000 4000 6000 8000 10000 12000 14000 16000
#(Queries)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Completionrate
AMDC
AMDC rand
AMDC pos only
AMDC no const
No neg-AUC
Random
Full AMDC
No constraints
26. /28
■ Experiments (2/2): Predictive model construction problem
Score: ROC-AUC
□ AMDC often achieves better AUC than Random (K, U)
□ Negative triples are also helpful to improve ROC-AUC
□ Constraints work to prevent overfitting
26
AMDC has achieved the best predictive score
Kinships
0 2000 4000 6000 8000 10000 12000 14000 16000
#(Queries)
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ROC-AUC
AMDC
AMDC rand
AMDC pos only
AMDC no const
Nations
0 200 400 600 800 1000 1200 1400 1600
#(Queries)
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ROC-AUC
AMDC
AMDC rand
AMDC pos only
AMDC no const
UMLS
0 2000 4000 6000 8000 10000 12000 14000 16000
#(Queries)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ROC-AUC
AMDC
AMDC rand
AMDC pos only
AMDC no const
10 trials, (Q, q) = (105, 103) ((2 103,102) for Nations)
No neg-AUC
Random
Full AMDC
No constraints
27. /28
■ Conclusions
□ Manual RDF dataset construction is still demanding
• Some datasets require hand annotation by its nature
• Crowdsourcing provides an easy way of recruiting annotators
It's time to consider the manual construction problem!
□ AMDC = active learning + multi-relational learning
• RESCAL-based multi-relational learning
□ 3 key contributions lead to better performance
• Active learning significantly reduces the cost
• Constraints prevents overfitting
• Negative AUC-loss works better in case of skewed datasets
27
We consider manual annotation problems of the RDF data
Our research focus is the manual RDF data construction
The reason why we focus on the “manual” construction, rather than the “automatic” construction is that some data are difficult to extract automatically from documents
So, we set our research question as “How can we support the human annotators?”
Our solution to this research question is to combine active learning and multi-relational learning techniques
We use a model of the RDF dataset called “a multi-relational model” to reduce the number of queries as much as possible
We first define the multi-relational dataset in the RDF format
A multi-relational dataset consists of binary-labeled triples
A triple is made up of two “entities” i & j, and one “relation” k. In this example, a dog, a human, and an animal are entities, and “is_a_part_of” and “is_the_same_as” are relations.
We assign a positive label to triple t if “entity i is in relation k with entity j”, and a negative label otherwise.
In this example, as a dog is a part of an animal, this triple is positive, but as a dog is not the same as a human, this triple is negative.
A multi-relational dataset is defined as sets of positive and negative triples.
Here, we assume that positive triples are much fewer than all the triples, and we allow some triples are not labeled.
Then, we give two examples to motivate our manual construction problems.
The first example is a knowledge base, which encodes human knowledge in the RDF format.
A famous dataset is the WordNet, which represents the relations between words.
Another example is the ConceptNet, which represents commonsense knowledge.
The point here is that commonsense knowledge rarely appears in documents.
So it is difficult to extract such a dataset automatically from documents.
The second example is a biological dataset.
For example, interactions between proteins and DNAs and the participation of chemical compounds to a biological mechanism such as a cell cycle.
To label a triple, researchers have to conduct experiments, and therefore, biological datasets need hand annotation to construct.
Finally, I’m going to state the formal problem settings of manual dataset construction.
We notice that there are two problem settings according to the usages of the dataset.
The first problem setting is called a “dataset construction problem”.
The goal of this problem setting is to collect as many positive triples as possible.
If the goal is to obtain a dataset, it is sufficient to collect the positive triples.
The second problem setting is called a “predictive model construction problem”.
The goal of this problem setting is to learn a multi-relational model, which predicts labels of unlabeled triples
First of all, I will show you the overview of AMDC.
Given the initial training dataset, AMDC trains the multi-relational model using the current dataset.
Then, AMDC is able to compute predictive scores to unlabeled triples
Larger score means that the model believes the triple is likely to be positive.
The model computes query scores.
Smaller query score means the triple is informative for the model
Based on the query scores, the model chooses informative triples and query labels of them.
Then, annotators return labels of them, and the model updates the dataset and retrain the model using the updated dataset.
AMDC repeats this procedure B times and finally output the model and the dataset.
Then, I’m going to present the details of AMDC.
Query scores are used to choose informative triples.
We design two query scores for the two problem settings.
The query scores are computed based on predictive scores given by the multi-relational model.
We assume that the predictive score has threshold 0 to discriminate positive and negative triples.
The first query score called a “positiveness score” is designed for Problem 1.
It chooses triples the model believes to be positive.
The second score called an “uncertainty score” is designed for Problem 2, which chooses triples that the model is uncertain.
AMDC handles the two problem settings just by switching this query score, and therefore, the other parts of AMDC are common between the two problem settings.
A multi-relational model of AMDC is based on RESCAL.
RESCAL models each entity as a latent vector and each relation as a latent matrix.
The predictive score of RESCAL is written in this form. The model is trained so that the larger score indicates the triple is likely to be positive.
We introduce additional constraints to RESCAL in order to stabilize learning in case of small samples.
Specifically, the latent vectors are restricted to the unit ball, and the latent matrices are restricted to rotation matrices.
We confirm the effect of adding these constraints in experiments.
Our model is trained using this optimization problem.
The first term is a typical AUC loss function, which induces the predictive score of a positive triple is larger than that of a non-positive triple.
As this objective function is robust to the positive-negative ratio of a training dataset, this objective function is often used to learn a multi-relational data model.
However, we find two issues in this objective function.
The first issue is that this AUC loss function does not distinguish negative triples from unlabeled triples.
In order to use the negative triples effectively, we combine a negative part of the AUC loss function, which induces the score of a non-negative triple is larger than the score of a negative triple.
As a result, we can effectively use both positive and negative triples. The effect of adding this objective function is also checked in the experiments.
The second issue is that the AUC loss functions cannot learn the threshold to discriminate positive and negative triples, which is necessary to compute query scores.
So we add the classification error function to calibrate the scores to have threshold 0.
Our model is trained using this optimization problem.
The first term is a typical AUC loss function, which induces the predictive score of a positive triple is larger than that of a non-positive triple.
As this objective function is robust to the positive-negative ratio of a training dataset, this objective function is often used to learn a multi-relational data model.
However, we find two issues in this objective function.
The first issue is that this AUC loss function does not distinguish negative triples from unlabeled triples.
In order to use the negative triples effectively, we combine a negative part of the AUC loss function, which induces the score of a non-negative triple is larger than the score of a negative triple.
As a result, we can effectively use both positive and negative triples. The effect of adding this objective function is also checked in the experiments.
The second issue is that the AUC loss functions cannot learn the threshold to discriminate positive and negative triples, which is necessary to compute query scores.
So we add the classification error function to calibrate the scores to have threshold 0.
Our model is trained using this optimization problem.
The first term is a typical AUC loss function, which induces the predictive score of a positive triple is larger than that of a non-positive triple.
As this objective function is robust to the positive-negative ratio of a training dataset, this objective function is often used to learn a multi-relational data model.
However, we find two issues in this objective function.
The first issue is that this AUC loss function does not distinguish negative triples from unlabeled triples.
In order to use the negative triples effectively, we combine a negative part of the AUC loss function, which induces the score of a non-negative triple is larger than the score of a negative triple.
As a result, we can effectively use both positive and negative triples. The effect of adding this objective function is also checked in the experiments.
The second issue is that the AUC loss functions cannot learn the threshold to discriminate positive and negative triples, which is necessary to compute query scores.
So we add the classification error function to calibrate the scores to have threshold 0.
Our model is trained using this optimization problem.
The first term is a typical AUC loss function, which induces the predictive score of a positive triple is larger than that of a non-positive triple.
As this objective function is robust to the positive-negative ratio of a training dataset, this objective function is often used to learn a multi-relational data model.
However, we find two issues in this objective function.
The first issue is that this AUC loss function does not distinguish negative triples from unlabeled triples.
In order to use the negative triples effectively, we combine a negative part of the AUC loss function, which induces the score of a non-negative triple is larger than the score of a negative triple.
As a result, we can effectively use both positive and negative triples. The effect of adding this objective function is also checked in the experiments.
The second issue is that the AUC loss functions cannot learn the threshold to discriminate positive and negative triples, which is necessary to compute query scores.
So we add the classification error function to calibrate the scores to have threshold 0.
As a result, we obtain this optimization problem to learn the model.
We use a stochastic gradient descent algorithm to solve the problem.
We choose the best set of hyperparameters using a validation set.
The main purpose of our experiments is to evaluate 3 contributions of AMDC in both problem settings.
The contributions are..., the query scores, the constraints on RESCAL, and the negative part of the AUC loss function.
We cut each part of AMDC to create three competing methods.
We use three datasets.
Annotators are simulated using the labels of these datasets.
The first experiment handles the dataset construction problem.
The score here is the percentage of positive triples collected by AMDC.
The x-axis of the charts is the number of queries, and the y-axis is the score.
We find that AMDC improves the score the best against the random one in the UMLS dataset.
Therefore, we confirm that AMDC is robust to the positive-negative ratio, because the UMLS dataset is the most skewed dataset.
Also, the negative part of the AUC loss function is helpful when there are many negative triples.
However, the additional constraints are not so effective in this context.
The second experiment handles the predictive model construction problem.
The score is the Area Under the ROC curve.
We first find that the full AMDC always achieves the best ROC-AUC.
Kinships and UMLS datasets show significant improvements over the random strategy.
In this problem setting, both the negative part of the AUC loss and the constraints work to improve the performance.
Therefore, we conclude that these two modifications work positively for this problem setting.