Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Uncertainty

•

0 likes•205 views

Numerous recent works utilize bi-Lipschitz regularization of neural network layers to preserve relative distances between data instances in the feature spaces of each layer. This distance sensitivity with respect to the data aids in tasks such as uncertainty calibration and out-of-distribution (OOD) detection. In previous works, features extracted with a distance sensitive model are used to construct feature covariance matrices which are used in deterministic uncertainty estimation or OOD detection. However, in cases where there is a distribution over tasks, these methods result in covariances which are sub-optimal, as they may not leverage all of the meta information which can be shared among tasks. With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices. Additionally, we propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution which is well calibrated under a distributional dataset shift.

Technology

Meta Learning Low Rank Covariance Factors for
Energy-Based Deterministic Uncertainty
Jeffrey Willette¹, Hae Beom Lee¹, Juho Lee¹⁻², Sung Ju Hwang¹⁻²
KAIST¹, AITRICS²

Prototypical networks first extract features into a shared metric space, and then
minimizes the distance between instances and their classwise centroids.
Background - Prototypical Networks
We utilize a prototypical network style backbone for meta learning in our work,
which can utilize the inverse covariances to compute the Mahalanobis distance.
[1] Protonet - Snell, J., et al. (2017). Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
Figure 1 from [1]

Previous successful deterministic uncertainty [1, 2] generally require a post
processing step which utilizes large training data.
Problem - Meta Deterministic Uncertainty
[1] SNGP - Liu, J., et al. (2020). Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems, 33, 7498-7512.
[2] DDU - Mukhoti, J., et al. (2021). Deterministic Neural Networks with Inductive Biases Capture Epistemic and Aleatoric Uncertainty. arXiv preprint arXiv:2102.11582.
[3] Protonet - Snell, J., et al. (2017). Prototypical networks for few-shot learning. Advances in neural information processing systems, 30.
This can adversely affect performance in the meta learning setting where:
1. Each task dataset may be small in size
2. There is no mechanism for learning shared knowledge between tasks.
Protonet [3] ProtoDDU [2] ProtoSNGP [1] Proto Mahalanobis (Ours)

Our method utilizes an attentive set encoder in conjunction with a smooth feature
extractor to predict diagonal or diagonal plus low-rank covariance factors
Method - Proto Mahalanobis
[1] Set Transformer - Lee, J., et al. (2019, May). Set transformer: A framework for attention-based permutation-invariant neural networks. In International Conference on Machine Learning (pp. 3744-3753). PMLR.
Encoding support sets as sets allows the encoder to meta learn shared knowledge
over the task distribution.

The softmax function is shift invariant, such that in the example below, z can be
shifted by any constant and the entropy of the softmax will not change.
Problem - Shift Invariant Softmax
Intuitively, if the instance falls very far, from then centroid, there should be less
confidence assigned to that instance.

We solve this by using a logit-normal softmax distribution. The mu value is the log
Gaussian density, and sigma is a log sum exponential of the energy.
Method - Logit Normal Softmax with Scaled Energy
Intuitively, the variance rises with the minimum energy magnitude. If the minimum
energy is high, then the resulting mean in softmax space will be more uniform.

We constructed a corrupted version of the Omniglot dataset based on common
corruptions [1].
Experiments - Omniglot-C
[1] Hendrycks, D., & Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261.
As the level of corruption increases, Proto Mahalanobis models maintain better
calibration

We look forward to seeing you at our poster session. Thanks for watching!
Conclusion

What's hot

Learning to compare: relation network for few shot learning

Simon John

Knowledge distillation deeplab

Frozen Paradise

Digit recognition using mnist database

btandale

Brief review of our paper published in ICLR2020: "Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network." Abstract: One of the biggest issues in deep learning theory is the generalization ability of networks with huge model size. The classical learning theory suggests that overparameterized models cause overfitting. However, practically used large deep models avoid overfitting, which is not well explained by the classical approaches. To resolve this issue, several attempts have been made. Among them, the compression based bound is one of the promising approaches. However, the compression based bound can be applied only to a compressed network, and it is not applicable to the non-compressed original network. In this paper, we give a unified frame-work that can convert compression based bounds to those for non-compressed original networks. The bound gives even better rate than the one for the compressed network by improving the bias term. By establishing the unified frame-work, we can obtain a data dependent generalization error bound which gives a tighter evaluation than the data independent ones.

Iclr2020: Compression based bound for non-compressed network: unified general...

Taiji Suzuki

Unsupervised learning aims to learn meaningful representations from unlabeled data which can captures its intrinsic structure, that can be transferred to downstream tasks. Meta-learning, whose objective is to learn to generalize across tasks such that the learned model can rapidly adapt to a novel task, shares the spirit of unsupervised learning in that the both seek to learn more effective and efficient learning procedure than learning from scratch. The fundamental difference of the two is that the most meta-learning approaches are supervised, assuming full access to the labels. However, acquiring labeled dataset for meta-training not only is costly as it requires human efforts in labeling but also limits its applications to pre-defined task distributions. In this paper, we propose a principled unsupervised meta-learning model, namely Meta-GMVAE, based on Variational Autoencoder (VAE) and set-level variational inference. Moreover, we introduce a mixture of Gaussian (GMM) prior, assuming that each modality represents each class-concept in a randomly sampled episode, which we optimize with Expectation-Maximization (EM). Then, the learned model can be used for downstream few-shot classification tasks, where we obtain task-specific parameters by performing semi-supervised EM on the latent representations of the support and query set, and predict labels of the query set by computing aggregated posteriors. We validate our model on Omniglot and Mini-ImageNet datasets by evaluating its performance on downstream few-shot classification tasks. The results show that our model obtain impressive performance gains over existing unsupervised meta-learning baselines, even outperforming supervised MAML on a certain setting.

Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning

MLAI2

Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace

Yoonho Lee

Meta learning with memory augmented neural network

Katy Lee

Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Anal...

JaeJun Yoo

Optimization as a model for few shot learning

Katy Lee

A neural filtering technique is proposed in this paper for restoring the images extremely corrupted with random valued impulse noise. The proposed intelligent filter is carried out in two stages. In first stage the corrupted image is filtered by applying an asymmetric trimmed median filter. An asymmetric trimmed median filtered output image is suitably combined with a feed forward neural network in the second stage. The internal parameters of the feed forward neural network are adaptively optimized by training of three well known images. This is quite effective in eliminating random valued impulse noise. Simulation results show that the proposed filter is superior in terms of eliminating impulse noise as well as preserving edges and fine details of digital images and results are compared with other existing nonlinear filters.

Random Valued Impulse Noise Elimination using Neural Filter

Editor IJCATR

Jaejun Yoo / Naver Clova AI researcher, Ph.D. [Outline] Why bother Style Transfer? What is Style Transfer? Developments of Style Transfer Why it works? Why VGG? Limitations of Neural Algorithm Recent trends (artistic & photorealistic) (Artistic) - TextureNet - Instance Norm - Conditional Instance Norm - Adaptive Instance Norm - Whitening & Coloring Transform (Photorealistic) - Deep Photo Style Transfer - PhotoWCT - WCT^2 (https://github.com/clovaai/WCT2) / video: https://www.youtube.com/watch?v=o-AgHt1VA30

A beginner's guide to Style Transfer and recent trends

JaeJun Yoo

Handwritten Digit Recognition(Convolutional Neural Network) PPT

RishabhTyagi48

Comparison of Learning Algorithms for Handwritten Digit Recognition

Safaa Alnabulsi

Abstract Face recognition is a form of computer vision that uses faces to identify a person or verify a person’s claimed identity. In this paper, a neural based algorithm is presented, to detect frontal views of faces. The dimensionality of input face image is reduced by the Principal component analysis and the Classification is by the neural back propagation network. This method is robust for a dataset of 300 face images and has better performance in terms of 80 – 90 % recognition rate.

Face Recognition Using Neural Networks

CSCJournals

Presentation slide of our NeurIPS2020 paper "Generalization bound of globally optimal non convex neural network training: Transportation map estimation by infinite dimensional Langevin dynamics" Abstract: We introduce a new theoretical framework to analyze deep learning optimization with connection to its generalization error. Existing frameworks such as mean field theory and neural tangent kernel theory for neural network optimization analysis typically require taking limit of infinite width of the network to show its global convergence. This potentially makes it difficult to directly deal with finite width network; especially in the neural tangent kernel regime, we cannot reveal favorable properties of neural networks {\it beyond kernel methods}. To realize more natural analysis, we consider a completely different approach in which we formulate the parameter training as a transportation map estimation and show its global convergence via the theory of the {\it infinite dimensional Langevin dynamics}. This enables us to analyze narrow and wide networks in a unifying manner. Moreover, we give generalization gap and excess risk bounds for the solution obtained by the dynamics. The excess risk bound achieves the so-called fast learning rate. In particular, we show an exponential convergence for a classification problem and a minimax optimal rate for a regression problem.

[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...

Taiji Suzuki

Visualizaing and understanding convolutional networks

SungminYou

Digit recognition

btandale

Perceptron and Sigmoid Neurons

Shajun Nisha

Few shot learning/ one shot learning/ machine learning

ﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...

taeseon ryu

What's hot (20)

Learning to compare: relation network for few shot learning

Knowledge distillation deeplab

Digit recognition using mnist database

Iclr2020: Compression based bound for non-compressed network: unified general...

Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning

Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace

Meta learning with memory augmented neural network

Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Anal...

Optimization as a model for few shot learning

Random Valued Impulse Noise Elimination using Neural Filter

A beginner's guide to Style Transfer and recent trends

Handwritten Digit Recognition(Convolutional Neural Network) PPT

Comparison of Learning Algorithms for Handwritten Digit Recognition

Face Recognition Using Neural Networks

[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...

Visualizaing and understanding convolutional networks

Digit recognition

Perceptron and Sigmoid Neurons

Few shot learning/ one shot learning/ machine learning

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...

Similar to Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Uncertainty

Classification is one of the most important task in application areas of artificial neural networks (ANN).Training neural networks is a complex task in the supervised learning field of research. The main difficulty in adopting ANN is to find the most appropriate combination of learning, transfer and training function for the classification task. We compared the performances of three types of training algorithms in feed forward neural network for brain hematoma classification. In this work we have selected Gradient Descent based backpropagation, Gradient Descent with momentum, Resilence backpropogation algorithms. Under conjugate based algorithms, Scaled Conjugate back propagation, Conjugate Gradient backpropagation with Polak-Riebreupdates(CGP) and Conjugate Gradient backpropagation with Fletcher-Reeves updates (CGF).The last category is Quasi Newton based algorithm, under this BFGS, Levenberg-Marquardt algorithms are selected. Proposed work compared training algorithm on the basis of mean square error, accuracy, rate of convergence and correctness of the classification. Our conclusion about the training functions is based on the simulation results

Comparison of Neural Network Training Functions for Hematoma Classification i...

IOSR Journals

Introduction Of Artificial neural network

Nagarajan

This study proposes Artificial Neural Network ANN based field strength prediction models for the rural areas of Abuja, the federal capital territory of Nigeria. The ANN based models were created on bases of the Generalized Regression Neural network GRNN and the Multi Layer Perceptron Neural Network MLP NN . These networks were created, trained and tested for field strength prediction using received power data recorded at 900MHz from multiple Base Transceiver Stations BTSs distributed across the rural areas. Results indicate that the GRNN and MLP NN based models with Root Mean Squared Error RMSE values of 4.78dBm and 5.56dBm respectively, offer significant improvement over the empirical Hata Okumura counterpart, which overestimates the signal strength by an RMSE value of 20.17dBm. Deme C. Abraham ""Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using Artificial Neural Networks"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020, URL: https://www.ijtsrd.com/papers/ijtsrd30228.pdf Paper Url : https://www.ijtsrd.com/computer-science/artificial-intelligence/30228/mobile-network-coverage-determination-at-900mhz-for-abuja-rural-areas-using-artificial-neural-networks/deme-c-abraham

Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...

ijtsrd

nonlinear_rmt.pdf

GieTe

Survey on Artificial Neural Network Learning Technique Algorithms

IRJET Journal

Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...

IOSR Journals

IRJET- Prediction of Autism Spectrum Disorder using Deep Learning: A Survey

IRJET Journal

D028018022

researchinventy

There is a vast amount of researched literature available on Route Finding and Link Establishment in MANET protocols based on various concepts such as “pro-active”, “reactive”, “power awareness”, “cross-layering” etc. Most of these techniques are rather restrictive, taking into account a few of the several aspects that go into effective route establishment. When we look at practical implementations of MANETs, we have to take into account various factors in totality, not in isolation. The several factors that decide and influence the routing have to be considered as a whole in the difficult task of finding the best solution in route finding and optimization. The inputs to the system are manifold and apparently unrelated. Most of the parameters are imprecise or non-crisp in nature. The uncertainty and imprecision lead to think that intelligent routing techniques are essential and important in evolving robust and dependable solutions to route finding. The obvious method by which this can be achieved is the deployment of soft computing techniques such as Neural Nets, Fuzzy Logic and Genetic algorithms. Neural Networks help us to solve the complex problem of transforming the inputs to outputs without apriori knowledge of what the relationship is between inputs and outputs. Fuzzy Logic helps us to deal with imprecise and ill-conditioned data. Genetic Algorithms help us to select the best possible solution from the solution space in an optimal sense. Our paper presented here below seeks to explore new horizons in this direction. The results of our experimentation have been very satisfactory and we have achieved the goal of optimal route finding to a large extent. There is of course considerable room for further refinements.

New Generation Routing Protocol over Mobile Ad Hoc Wireless Networks based on...

ijasuc

New Generation Routing Protocol over Mobile Ad Hoc Wireless Networks based on...

ijasuc

International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.

B42010712

IJERA Editor

This work aims to introduce a novel approach for auxiliary task guidance (ATG). In this approach, our goal is to achieve effective guidance from a suitable auxiliary task by utilizing the uncertainty in calculated gradients for a mini-batch of samples. Our method calculates a probabilistic fitness factor of the auxiliary task gradient for each of the shared weights to guide the main task at every training step of mini-batch gradient descent. We have shown that this proposed factor incorporates task specific confidence of learning to manipulate ATG in an effective manner. For studying the potency of the method, monocular visual odometry (VO) has been chosen as an application. Substantial experiments have been done on the KITTI VO dataset for solving monocular VO with a simple convolutional neural network (CNN) architecture. Corresponding results show that our ATG method significantly boosts the performance of supervised learning for VO. It also out performs state-of-the-art (SOTA) auxiliary guided methods we applied for VO. The proposed method is able to achieve decent scores (in some cases competitive)compared to existing SOTA supervised monocular VO algorithms, while keeping an exceptionally low parameter space in supervised regime.

Boosting auxiliary task guidance: a probabilistic approach

IAESIJAI

Artificial neural networks are models inspired by human nervous system that is capable of learning. One of the important applications of artificial neural network is character Recognition. Character Recognition finds its application in number of areas, such as banking, security products, hospitals, in robotics also. This paper is based on a system that recognizes a english numeral, given by the user, which is already trained on the features of the numbers to be recognized using NNT (Neural network toolbox) .The system has a neural network as its core, which is first trained on a database. The training of the neural network extracts the features of the English numbers and stores in the database. The next phase of the system is to recognize the number given by the user. The features of the number given by the user are extracted and compared with the feature database and the recognized number is displayed.

Neural network based numerical digits recognization using nnt in matlab

ijcses

Extracted pages from Neural Fuzzy Systems.docx

dannyabe

Levenberg marquardt-algorithm-for-karachi-stock-exchange-share-rates-forecast...

Cemal Ardil

Short Term Load Forecasting (STLF) can predict load from several minutes to week plays the vital role to address challenges such as optimal generation, economic scheduling, dispatching and contingency analysis. This paper uses Multi-Layer Perceptron (MLP) Artificial Neural Network (ANN) technique to perform STFL but long training time and convergence issues caused by bias, variance and less generalization ability, unable this algorithm to accurately predict future loads. This issue can be resolved by various methods of Bootstraps Aggregating (Bagging) (like disjoint partitions, small bags, replica small bags and disjoint bags) which helps in reducing variance and increasing generalization ability of ANN. Moreover, it results in reducing error in the learning process of ANN. Disjoint partition proves to be the most accurate Bagging method and combining outputs of this method by taking mean improves the overall performance. This method of combining several predictors known as Ensemble Artificial Neural Network (EANN) outperform the ANN and Bagging method by further increasing the generalization ability and STLF accuracy.

Short Term Load Forecasting Using Bootstrap Aggregating Based Ensemble Artifi...

Kashif Mehmood

International Journal of Computational Engineering Research(IJCER)

ijceronline

In this work, the TREPAN algorithm is enhanced and extended for extracting decision trees from neural networks. We empirically evaluated the performance of the algorithm on a set of databases from real world events. This benchmark enhancement was achieved by adapting Single-test TREPAN and C4.5 decision tree induction algorithms to analyze the datasets. The models are then compared with X-TREPAN for comprehensibility and classification accuracy. Furthermore, we validate the experimentations by applying statistical methods. Finally, the modified algorithm is extended to work with multi-class regression problems and the ability to comprehend generalized feed forward networks is achieved.

X trepan an extended trepan for

ijaia

C42021115

IJERA Editor

Ffnn

guestd60a613

Similar to Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Uncertainty (20)

Comparison of Neural Network Training Functions for Hematoma Classification i...

Introduction Of Artificial neural network

Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...

nonlinear_rmt.pdf

Survey on Artificial Neural Network Learning Technique Algorithms

Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...

IRJET- Prediction of Autism Spectrum Disorder using Deep Learning: A Survey

D028018022

New Generation Routing Protocol over Mobile Ad Hoc Wireless Networks based on...

B42010712

Boosting auxiliary task guidance: a probabilistic approach

Neural network based numerical digits recognization using nnt in matlab

Extracted pages from Neural Fuzzy Systems.docx

Levenberg marquardt-algorithm-for-karachi-stock-exchange-share-rates-forecast...

Short Term Load Forecasting Using Bootstrap Aggregating Based Ensemble Artifi...

International Journal of Computational Engineering Research(IJCER)

X trepan an extended trepan for

C42021115

Ffnn

Online Hyperparameter Meta-Learning with Hypergradient Distillation

MLAI2

Multilingual models jointly pretrained on multiple languages have achieved remarkable performance on various multilingual downstream tasks. Moreover, models finetuned on a single monolingual downstream task have shown to generalize to unseen languages. In this paper, we first show that it is crucial for those tasks to align gradients between them in order to maximize knowledge transfer while minimizing negative transfer. Despite its importance, the existing methods for gradient alignment either have a completely different purpose, ignore inter-task alignment, or aim to solve continual learning problems in rather inefficient ways. As a result of the misaligned gradients between tasks, the model suffers from severe negative transfer in the form of catastrophic forgetting of the knowledge acquired from the pretraining. To overcome the limitations, we propose a simple yet effective method that can efficiently align gradients between tasks. Specifically, we perform each inner-optimization by sequentially sampling batches from all the tasks, followed by a Reptile outer update. Thanks to the gradients aligned between tasks by our method, the model becomes less vulnerable to negative transfer and catastrophic forgetting. We extensively validate our method on various multi-task learning and zero-shot cross-lingual transfer tasks, where our method largely outperforms all the relevant baselines we consider.

Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning

MLAI2

While deep reinforcement learning methods have shown impressive results in robot learning, their sample inefficiency makes the learning of complex, long-horizon behaviors with real robot systems infeasible. To mitigate this issue, meta-reinforcement learning methods aim to enable fast learning on novel tasks by learning how to learn. Yet, the application has been limited to short-horizon tasks with dense rewards. To enable learning long-horizon behaviors, recent works have explored leveraging prior experience in the form of offline datasets without reward or task annotations. While these approaches yield improved sample efficiency, millions of interactions with environments are still required to solve complex tasks. In this work, we devise a method that enables meta-learning on long-horizon, sparse-reward tasks, allowing us to solve unseen target tasks with orders of magnitude fewer environment interactions. Our core idea is to leverage prior experience extracted from offline datasets during meta-learning. Specifically, we propose to (1) extract reusable skills and a skill prior from offline datasets, (2) meta-train a high-level policy that learns to efficiently compose learned skills into long-horizon behaviors, and (3) rapidly adapt the meta-trained policy to solve an unseen target task. Experimental results on continuous control tasks in navigation and manipulation demonstrate that the proposed method can efficiently solve long-horizon novel target tasks by combining the strengths of meta-learning and the usage of offline datasets, while prior approaches in RL, meta-RL, and multi-task RL require substantially more environment interactions to solve the tasks.

Skill-Based Meta-Reinforcement Learning

MLAI2

Graph neural networks have recently achieved remarkable success in representing graph-structured data, with rapid progress in both the node embedding and graph pooling methods. Yet, they mostly focus on capturing information from the nodes considering their connectivity, and not much work has been done in representing the edges, which are essential components of a graph. However, for tasks such as graph reconstruction and generation, as well as graph classification tasks for which the edges are important for discrimination, accurately representing edges of a given graph is crucial to the success of the graph representation learning. To this end, we propose a novel edge representation learning framework based on Dual Hypergraph Transformation (DHT), which transforms the edges of a graph into the nodes of a hypergraph. This dual hypergraph construction allows us to apply message-passing techniques for node representations to edges. After obtaining edge representations from the hypergraphs, we then cluster or drop edges to obtain holistic graph-level edge representations. We validate our edge representation learning method with hypergraphs on diverse graph datasets for graph representation and generation performance, on which our method largely outperforms existing graph representation learning methods. Moreover, our edge representation learning and pooling method also largely outperforms state-of-theart graph pooling methods on graph classification, not only because of its accurate edge representation learning, but also due to its lossless compression of the nodes and removal of irrelevant edges for effective message-passing. Code is available at https://github.com/harryjo97/EHGNN.

Edge Representation Learning with Hypergraphs

MLAI2

Recently, utilizing reinforcement learning (RL) to generate molecules with desired properties has been highlighted as a promising strategy for drug design. Molecular docking program -- a physical simulation that estimates protein-small molecule binding affinity -- can be an ideal reward scoring function for RL, as it is a straightforward proxy of the therapeutic potential. Still, two imminent challenges exist for this task. First, the models often fail to generate chemically realistic and pharmacochemically acceptable molecules. Second, the docking score optimization is a difficult exploration problem that involves many local optima and less smooth surface with respect to molecular structure. To tackle these challenges, we propose a novel RL framework that generates pharmacochemically acceptable molecules with large docking scores. Our method -- Fragment-based generative RL with Explorative Experience replay for Drug design (FREED) -- constrains the generated molecules to a realistic and qualified chemical space and effectively explores the space to find drugs by coupling our fragment-based generation method and a novel error-prioritized experience replay (PER). We also show that our model performs well on both de novo and scaffold-based schemes. Our model produces molecules of higher quality compared to existing methods while achieving state-of-the-art performance on two of three targets in terms of the docking scores of the generated molecules. We further show with ablation studies that our method, predictive error-PER (FREED(PE)), significantly improves the model performance.

Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...

MLAI2

Most existing set encoding algorithms operate under the implicit assumption that all the set elements are accessible, and that there are ample computational and memory resources to load the set into memory during training and inference. However, both assumptions fail when the set is excessively large such that it is impossible to load all set elements into memory, or when data arrives in a stream. To tackle such practical challenges in large-scale set encoding, the general set-function constraints of permutation invariance and equivariance are not sufficient. We introduce a new property termed Mini-Batch Consistency (MBC) that is required for large scale mini-batch set encoding. Additionally, we present a scalable and efficient attention-based set encoding mechanism that is amenable to mini-batch processing of sets, and capable of updating set representations as data arrives. The proposed method adheres to the required symmetries of invariance and equivariance as well as maintaining MBC for any partition of the input set. We perform extensive experiments and show that our method is computationally efficient and results in rich set encoding representations for set-structured data.

Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding

MLAI2

Most conventional Neural Architecture Search (NAS) approaches are limited in that they only generate architectures without searching for the optimal parameters. While some NAS methods handle this issue by utilizing a supernet trained on a large-scale dataset such as ImageNet, they may be suboptimal if the target tasks are highly dissimilar from the dataset the supernet is trained on. To address such limitations, we introduce a novel problem of Neural Network Search (NNS), whose goal is to search for the optimal pretrained network for a novel dataset and constraints (e.g. number of parameters), from a model zoo. Then, we propose a novel framework to tackle the problem, namely Task-Adaptive Neural Network Search (TANS). Given a model-zoo that consists of network pretrained on diverse datasets, we use a novel amortized meta-learning framework to learn a cross-modal latent space with contrastive loss, to maximize the similarity between a dataset and a high-performing network on it, and minimize the similarity between irrelevant dataset-network pairs. We validate the effectiveness and efficiency of our method on ten real-world datasets, against existing NAS/AutoML baselines. The results show that our method instantly retrieves networks that outperform models obtained with the baselines with significantly fewer training steps to reach the target performance, thus minimizing the total cost of obtaining a task-optimal network. Our code and the model-zoo are available at https://anonymous.4open.science/r/TANS-33D6

Task Adaptive Neural Network Search with Meta-Contrastive Learning

MLAI2

While existing federated learning approaches mostly require that clients have fully-labeled data to train on, in realistic settings, data obtained at the client-side often comes without any accompanying labels. Such deficiency of labels may result from either high labeling cost, or difficulty of annotation due to the requirement of expert knowledge. Thus the private data at each client may be either partly labeled, or completely unlabeled with labeled data being available only at the server, which leads us to a new practical federated learning problem, namely Federated Semi-Supervised Learning (FSSL). In this work, we study two essential scenarios of FSSL based on the location of the labeled data. The first scenario considers a conventional case where clients have both labeled and unlabeled data (labels-at-client), and the second scenario considers a more challenging case, where the labeled data is only available at the server (labels-at-server). We then propose a novel method to tackle the problems, which we refer to as Federated Matching (FedMatch). FedMatch improves upon naive combinations of federated learning and semi-supervised learning approaches with a new inter-client consistency loss and decomposition of the parameters for disjoint learning on labeled and unlabeled data. Through extensive experimental validation of our method in the two different scenarios, we show that our method outperforms both local semi-supervised learning and baselines which naively combine federated learning with semi-supervised learning.

Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...

MLAI2

Graph neural networks have been widely used on modeling graph data, achieving impressive results on node classification and link prediction tasks. Yet, obtaining an accurate representation for a graph further requires a pooling function that maps a set of node representations into a compact form. A simple sum or average over all node representations considers all node features equally without consideration of their task relevance, and any structural dependencies among them. Recently proposed hierarchical graph pooling methods, on the other hand, may yield the same representation for two different graphs that are distinguished by the Weisfeiler-Lehman test, as they suboptimally preserve information from the node features. To tackle these limitations of existing graph pooling methods, we first formulate the graph pooling problem as a multiset encoding problem with auxiliary information about the graph structure, and propose a Graph Multiset Transformer (GMT) which is a multi-head attention based global pooling layer that captures the interaction between nodes according to their structural dependencies. We show that GMT satisfies both injectiveness and permutation invariance, such that it is at most as powerful as the Weisfeiler-Lehman graph isomorphism test. Moreover, our methods can be easily extended to the previous node clustering approaches for hierarchical graph pooling. Our experimental results show that GMT significantly outperforms state-of-the-art graph pooling methods on graph classification benchmarks with high memory and time efficiency, and obtains even larger performance gain on graph reconstruction and generation tasks.

Accurate Learning of Graph Representations with Graph Multiset Pooling

MLAI2

Recently, sequence-to-sequence (seq2seq) models with the Transformer architecture have achieved remarkable performance on various conditional text generation tasks, such as machine translation. However, most of them are trained with teacher forcing with the ground truth label given at each time step, without being exposed to incorrectly generated tokens during training, which hurts its generalization to unseen inputs, that is known as the “exposure bias” problem. In this work, we propose to mitigate the conditional text generation problem by contrasting positive pairs with negative pairs, such that the model is exposed to various valid or incorrect perturbations of the inputs, for improved generalization. However, training the model with naïve contrastive learning framework using random non-target sequences as negative examples is suboptimal, since they are easily distinguishable from the correct output, especially so with models pretrained with large text corpora. Also, generating positive examples requires domain-specific augmentation heuristics which may not generalize over diverse domains. To tackle this problem, we propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models. Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood, and positive examples by adding large perturbations while enforcing it to have a high conditional likelihood. Such “hard” positive and negative pairs generated using our method guides the model to better distinguish correct outputs from incorrect ones. We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks — machine translation, text summarization, and question generation.

Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...

MLAI2

Although recent multi-task learning methods have shown to be effective in improving the generalization of deep neural networks, they should be used with caution for safety-critical applications, such as clinical risk prediction. This is because even if they achieve improved task-average performance, they may still yield degraded performance on individual tasks, which may be critical (e.g., prediction of mortality risk). Existing asymmetric multi-task learning methods tackle this negative transfer problem by performing knowledge transfer from tasks with low loss to tasks with high loss. However, using loss as a measure of reliability is risky since it could be a result of overfitting. In the case of time-series prediction tasks, knowledge learned for one task (e.g., predicting the sepsis onset) at a specific timestep may be useful for learning another task (e.g., prediction of mortality) at a later timestep, but lack of loss at each timestep makes it difficult to measure the reliability at each timestep. To capture such dynamically changing asymmetric relationships between tasks in time-series data, we propose a novel temporal asymmetric multi-task learning model that performs knowledge transfer from certain tasks/timesteps to relevant uncertain tasks, based on feature-level uncertainty. We validate our model on multiple clinical risk prediction tasks against various deep learning models for time-series prediction, which our model significantly outperforms, without any sign of negative transfer. Further qualitative analysis of learned knowledge graphs by clinicians shows that they are helpful in analyzing the predictions of the model. Our final code is available at this https://github.com/anhtuan5696/TPAMTL.

Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...

MLAI2

Regularization and transfer learning are two popular techniques to enhance generalization on unseen data, which is a fundamental problem of machine learning. Regularization techniques are versatile, as they are task- and architecture-agnostic, but they do not exploit a large amount of data available. Transfer learning methods learn to transfer knowledge from one domain to another, but may not generalize across tasks and architectures, and may introduce new training cost for adapting to the target task. To bridge the gap between the two, we propose a transferable perturbation, MetaPerturb, which is meta-learned to improve generalization performance on unseen data. MetaPerturb is implemented as a set-based lightweight network that is agnostic to the size and the order of the input, which is shared across the layers. Then, we propose a meta-learning framework, to jointly train the perturbation function over heterogeneous tasks in parallel. As MetaPerturb is a set-function trained over diverse distributions across layers and tasks, it can generalize to heterogeneous tasks and architectures. We validate the efficacy and generality of MetaPerturb trained on a specific source domain and architecture, by applying it to the training of diverse neural architectures on heterogeneous target datasets against various regularizers and fine-tuning. The results show that the networks trained with MetaPerturb significantly outperform the baselines on most of the tasks and architectures, with a negligible increase in the parameter size and no hyperparameters to tune.

MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures

MLAI2

Existing adversarial learning approaches mostly use class labels to generate adversarial samples that lead to incorrect predictions, which are then used to augment the training of the model for improved robustness. While some recent works propose semi-supervised adversarial learning methods that utilize unlabeled data, they still require class labels. However, do we really need class labels at all, for adversarially robust training of deep neural networks? In this paper, we propose a novel adversarial attack for unlabeled data, which makes the model confuse the instance-level identities of the perturbed data samples. Further, we present a self-supervised contrastive learning framework to adversarially train a robust neural network without labeled data, which aims to maximize the similarity between a random augmentation of a data sample and its instance-wise adversarial perturbation. We validate our method, Robust Contrastive Learning (RoCL), on multiple benchmark datasets, on which it obtains comparable robust accuracy over state-of-the-art supervised adversarial learning methods, and significantly improved robustness against the black box and unseen types of attacks. Moreover, with further joint fine-tuning with supervised adversarial loss, RoCL obtains even higher robust accuracy over using self-supervised learning alone. Notably, RoCL also demonstrate impressive results in robust transfer learning.

Adversarial Self-Supervised Contrastive Learning

MLAI2

Many practical graph problems, such as knowledge graph construction and drug-drug interaction prediction, require to handle multi-relational graphs. However, handling real-world multi-relational graphs with Graph Neural Networks (GNNs) is often challenging due to their evolving nature, as new entities (nodes) can emerge over time. Moreover, newly emerged entities often have few links, which makes the learning even more difficult. Motivated by this challenge, we introduce a realistic problem of few-shot out-of-graph link prediction, where we not only predict the links between the seen and unseen nodes as in a conventional out-of-knowledge link prediction task but also between the unseen nodes, with only few edges per node. We tackle this problem with a novel transductive meta-learning framework which we refer to as Graph Extrapolation Networks (GEN). GEN meta-learns both the node embedding network for inductive inference (seen-to-unseen) and the link prediction network for transductive inference (unseen-to-unseen). For transductive link prediction, we further propose a stochastic embedding layer to model uncertainty in the link prediction between unseen entities. We validate our model on multiple benchmark datasets for knowledge graph completion and drug-drug interaction prediction. The results show that our model significantly outperforms relevant baselines for out-of-graph link prediction tasks.

Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...

MLAI2

We propose a method to automatically generate a domain- and task-adaptive maskings of the given text for self-supervised pre-training, such that we can effectively adapt the language model to a particular target task (e.g. question answering). Specifically, we present a novel reinforcement learning-based framework which learns the masking policy, such that using the generated masks for further pre-training of the target language model helps improve task performance on unseen texts. We use off-policy actor-critic with entropy regularization and experience replay for reinforcement learning, and propose a Transformer-based policy network that can consider the relative importance of words in a given text. We validate our Neural Mask Generator (NMG) on several question answering and text classification datasets using BERT and DistilBERT as the language models, on which it outperforms rule-based masking strategies, by automatically learning optimal adaptive maskings.

Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...

MLAI2

We propose a novel interactive learning framework which we refer to as Interactive Attention Learning (IAL), in which the human supervisors interactively manipulate the allocated attentions, to correct the model's behavior by updating the attention-generating network. However, such a model is prone to overfitting due to scarcity of human annotations, and requires costly retraining. Moreover, it is almost infeasible for the human annotators to examine attentions on tons of instances and features. We tackle these challenges by proposing a sample-efficient attention mechanism and a cost-effective reranking algorithm for instances and features. First, we propose Neural Attention Process (NAP), which is an attention generator that can update its behavior by incorporating new attention-level supervisions without any retraining. Secondly, we propose an algorithm which prioritizes the instances and the features by their negative impacts, such that the model can yield large improvements with minimal human feedback. We validate IAL on various time-series datasets from multiple domains (healthcare, real-estate, and computer vision) on which it significantly outperforms baselines with conventional attention mechanisms, or without cost-effective reranking, with substantially less retraining and human-model interaction cost.

Cost-effective Interactive Attention Learning with Neural Attention Process

MLAI2

Despite the remarkable performance of deep neural networks on various computer vision tasks, they are known to be susceptible to adversarial perturbations, which makes it challenging to deploy them in real-world safety-critical applications. In this paper, we conjecture that the leading cause of adversarial vulnerability is the distortion in the latent feature space, and provide methods to suppress them effectively. Explicitly, we define vulnerability for each latent feature and then propose a new loss for adversarial learning, Vulnerability Suppression (VS) loss, that aims to minimize the feature-level vulnerability during training. We further propose a Bayesian framework to prune features with high vulnerability to reduce both vulnerability and loss on adversarial samples. We validate our Adversarial Neural Pruning with Vulnerability Suppression (ANP-VS) method on multiple benchmark datasets, on which it not only obtains state-of-the-art adversarial robustness but also improves the performance on clean examples, using only a fraction of the parameters used by the full network. Further qualitative analysis suggests that the improvements come from the suppression of feature-level vulnerability.

Adversarial Neural Pruning with Latent Vulnerability Suppression

MLAI2

One of the most crucial challenges in question answering (QA) is the scarcity of labeled data, since it is costly to obtain question-answer (QA) pairs for a target text domain with human annotation. An alternative approach to tackle the problem is to use automatically generated QA pairs from either the problem context or from large amount of unstructured texts (e.g. Wikipedia). In this work, we propose a hierarchical conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts, while maximizing the mutual information between generated QA pairs to ensure their consistency. We validate our Information Maximizing Hierarchical Conditional Variational AutoEncoder (Info-HCVAE) on several benchmark datasets by evaluating the performance of the QA model (BERT-base) using only the generated QA pairs (QA-based evaluation) or by using both the generated and human-labeled pairs (semi-supervised learning) for training, against state-of-the-art baseline models. The results show that our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.

Generating Diverse and Consistent QA pairs from Contexts with Information-Max...

MLAI2

While tasks could come with varying the number of instances and classes in realistic settings, the existing meta-learning approaches for few-shot classification assume that number of instances per task and class is fixed. Due to such restriction, they learn to equally utilize the meta-knowledge across all the tasks, even when the number of instances per task and class largely varies. Moreover, they do not consider distributional difference in unseen tasks, on which the meta-knowledge may have less usefulness depending on the task relatedness. To overcome these limitations, we propose a novel meta-learning model that adaptively balances the effect of the meta-learning and task-specific learning within each task. Through the learning of the balancing variables, we can decide whether to obtain a solution by relying on the meta-knowledge or task-specific learning. We formulate this objective into a Bayesian inference framework and tackle it using variational inference. We validate our Bayesian Task-Adaptive Meta-Learning (Bayesian TAML) on two realistic task- and class-imbalanced datasets, on which it significantly outperforms existing meta-learning approaches. Further ablation study confirms the effectiveness of each balancing component and the Bayesian learning framework.

Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...

MLAI2

A machine learning model that generalizes well should obtain low errors on unseen test examples. Thus, if we know how to optimally perturb training examples to account for test examples, we may achieve better generalization performance. However, obtaining such perturbation is not possible in standard machine learning frameworks as the distribution of the test data is unknown. To tackle this challenge, we propose a novel regularization method, meta-dropout, which learns to perturb the latent features of training examples for generalization in a meta-learning framework. Specifically, we meta-learn a noise generator which outputs a multiplicative noise distribution for latent features, to obtain low errors on the test instances in an input-dependent manner. Then, the learned noise generator can perturb the training examples of unseen tasks at the meta-test time for improved generalization. We validate our method on few-shot classification datasets, whose results show that it significantly improves the generalization performance of the base model, and largely outperforms existing regularization methods such as information bottleneck, manifold mixup, and information dropout.

Meta Dropout: Learning to Perturb Latent Features for Generalization

MLAI2

More from MLAI2 (20)