Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesMatt Moores
The grouped independence Metropolis-Hastings (GIMH) and Markov chain within Metropolis (MCWM) algorithms are pseudo-marginal methods used to perform Bayesian inference in latent variable models. These methods replace intractable likelihood calculations with unbiased estimates within Markov chain Monte Carlo algorithms. The GIMH method has the posterior of interest as its limiting distribution, but suffers from poor mixing if it is too computationally intensive to obtain high-precision likelihood estimates. The MCWM algorithm has better mixing properties, but less theoretical support. In this paper we accelerate the GIMH method by using a Gaussian process (GP) approximation to the log-likelihood and train this GP using a short pilot run of the MCWM algorithm. Our new method, GP-GIMH, is illustrated on simulated data from a stochastic volatility and a gene network model. Our approach produces reasonable estimates of the univariate and bivariate posterior distributions, and the posterior correlation matrix in these examples with at least an order of magnitude improvement in computing time.
Accelerating Pseudo-Marginal MCMC using Gaussian ProcessesMatt Moores
The grouped independence Metropolis-Hastings (GIMH) and Markov chain within Metropolis (MCWM) algorithms are pseudo-marginal methods used to perform Bayesian inference in latent variable models. These methods replace intractable likelihood calculations with unbiased estimates within Markov chain Monte Carlo algorithms. The GIMH method has the posterior of interest as its limiting distribution, but suffers from poor mixing if it is too computationally intensive to obtain high-precision likelihood estimates. The MCWM algorithm has better mixing properties, but less theoretical support. In this paper we accelerate the GIMH method by using a Gaussian process (GP) approximation to the log-likelihood and train this GP using a short pilot run of the MCWM algorithm. Our new method, GP-GIMH, is illustrated on simulated data from a stochastic volatility and a gene network model. Our approach produces reasonable estimates of the univariate and bivariate posterior distributions, and the posterior correlation matrix in these examples with at least an order of magnitude improvement in computing time.
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
We present faster practical encoding and decoding procedures for block compression. Such encoding and decoding procedures are important to efficiently support rank/select queries on compressed bit vectors. This paper was presented at the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017) in Palermo, Italy.
Deep Convolutional GANs - meaning of latent spaceHansol Kang
DCGAN은 GAN에 단순히 conv net을 적용했을 뿐만 아니라, latent space에서도 의미를 찾음.
DCGAN 논문 리뷰 및 PyTorch 기반의 구현.
VAE 세미나 이슈 사항에 대한 리뷰.
my github : https://github.com/messy-snail/GAN_PyTorch
[참고]
https://github.com/znxlwm/pytorch-MNIST-CelebA-GAN-DCGAN
https://github.com/taeoh-kim/Pytorch_DCGAN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEMJesus Velasquez
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
is a META-FRAMEWORK for Mathematical Programming.
Oriented towards the design, implementation and setup of decision support systems based in mathematical programming with special emphasis in the development of final user apps:
- The algebraic formulation is independent from any programming language
- The models can be connected with any data server
Thereby generating apps using multiple commercial or noncommercial tech according to clients’ needs
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
Original GAN 논문 리뷰 및 PyTorch 기반의 구현.
딥러닝 개발환경 및 언어 비교.
[참고]
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.
Wang, Su. "Generative Adversarial Networks (GAN) A Gentle Introduction."
초짜 대학원생의 입장에서 이해하는 Generative Adversarial Networks (https://jaejunyoo.blogspot.com/)
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기 (https://www.slideshare.net/NaverEngineering/1-gangenerative-adversarial-network)
프레임워크 비교(https://deeplearning4j.org/kr/compare-dl4j-torch7-pylearn)
AI 개발에AI 개발에 가장 적합한 5가지 프로그래밍 언어 (http://www.itworld.co.kr/news/109189#csidxf9226c7578dd101b41d03bfedfec05e)
Git는 머꼬? GitHub는 또 머지?(https://www.slideshare.net/ianychoi/git-github-46020592)
svn 능력자를 위한 git 개념 가이드(https://www.slideshare.net/einsub/svn-git-17386752)
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
LSGAN은 기존의 GAN loss가 아닌 MSE loss를 사용하여, 더욱 realistic한 데이터를 생성함.
LSGAN 논문 리뷰 및 PyTorch 기반의 구현.
[참고]
Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
발표자: 이활석 (Naver Clova)
발표일: 2017.11.
(현) NAVER Clova Vision
(현) TFKR 운영진
개요:
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨지고 있습니다.
특히 컴퓨터 비전 기술 분야에서는 지도학습에 해당하는 이미지 내에 존재하는 정보를 찾는 인식 기술에서,
비지도학습에 해당하는 특정 정보를 담는 이미지를 생성하는 기술인 생성 기술로 연구 동향이 바뀌어 가고 있습니다.
본 세미나에서는 생성 기술의 두 축을 담당하고 있는 VAE(variational autoencoder)와 GAN(generative adversarial network) 동작 원리에 대해서 간략히 살펴 보고, 관련된 주요 논문들의 결과를 공유하고자 합니다.
딥러닝에 대한 지식이 없더라도 생성 모델을 학습할 수 있는 두 방법론인 VAE와 GAN의 개념에 대해 이해하고
그 기술 수준을 파악할 수 있도록 강의 내용을 구성하였습니다.
Full paper: https://arxiv.org/pdf/1804.02339.pdf
We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
Talk at EURECOM, France.
It overviews regression in several of its forms: regularized, constrained, and mixed. It builds the bridge between machine learning and dynamical models.
Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.
We present faster practical encoding and decoding procedures for block compression. Such encoding and decoding procedures are important to efficiently support rank/select queries on compressed bit vectors. This paper was presented at the 24th International Symposium on String Processing and Information Retrieval (SPIRE 2017) in Palermo, Italy.
Deep Convolutional GANs - meaning of latent spaceHansol Kang
DCGAN은 GAN에 단순히 conv net을 적용했을 뿐만 아니라, latent space에서도 의미를 찾음.
DCGAN 논문 리뷰 및 PyTorch 기반의 구현.
VAE 세미나 이슈 사항에 대한 리뷰.
my github : https://github.com/messy-snail/GAN_PyTorch
[참고]
https://github.com/znxlwm/pytorch-MNIST-CelebA-GAN-DCGAN
https://github.com/taeoh-kim/Pytorch_DCGAN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEMJesus Velasquez
OPTEX MATHEMATICAL MODELING AND MANAGEMENT SYSTEM
is a META-FRAMEWORK for Mathematical Programming.
Oriented towards the design, implementation and setup of decision support systems based in mathematical programming with special emphasis in the development of final user apps:
- The algebraic formulation is independent from any programming language
- The models can be connected with any data server
Thereby generating apps using multiple commercial or noncommercial tech according to clients’ needs
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)Hansol Kang
Original GAN 논문 리뷰 및 PyTorch 기반의 구현.
딥러닝 개발환경 및 언어 비교.
[참고]
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.
Wang, Su. "Generative Adversarial Networks (GAN) A Gentle Introduction."
초짜 대학원생의 입장에서 이해하는 Generative Adversarial Networks (https://jaejunyoo.blogspot.com/)
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기 (https://www.slideshare.net/NaverEngineering/1-gangenerative-adversarial-network)
프레임워크 비교(https://deeplearning4j.org/kr/compare-dl4j-torch7-pylearn)
AI 개발에AI 개발에 가장 적합한 5가지 프로그래밍 언어 (http://www.itworld.co.kr/news/109189#csidxf9226c7578dd101b41d03bfedfec05e)
Git는 머꼬? GitHub는 또 머지?(https://www.slideshare.net/ianychoi/git-github-46020592)
svn 능력자를 위한 git 개념 가이드(https://www.slideshare.net/einsub/svn-git-17386752)
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
45 min talk about collecting home network performance measures, analyzing and forecasting time series data, and building anomaly detection system.
In this talk, we will go through the whole process of data mining and knowledge discovery. Firstly we write a script to run speed test periodically and log the metric. Then we parse the log data and convert them into a time series and visualize the data for a certain period.
Next we conduct some data analysis; finding trends, forecasting, and detecting anomalous data. There will be several statistic or deep learning techniques used for the analysis; ARIMA (Autoregressive Integrated Moving Average), LSTM (Long Short Term Memory).
LSGAN - SIMPle(Simple Idea Meaningful Performance Level up)Hansol Kang
LSGAN은 기존의 GAN loss가 아닌 MSE loss를 사용하여, 더욱 realistic한 데이터를 생성함.
LSGAN 논문 리뷰 및 PyTorch 기반의 구현.
[참고]
Mao, Xudong, et al. "Least squares generative adversarial networks." Proceedings of the IEEE International Conference on Computer Vision. 2017.
Multinomial Logistic Regression with Apache SparkDB Tsai
Logistic Regression can not only be used for modeling binary outcomes but also multinomial outcome with some extension. In this talk, DB will talk about basic idea of binary logistic regression step by step, and then extend to multinomial one. He will show how easy it's with Spark to parallelize this iterative algorithm by utilizing the in-memory RDD cache to scale horizontally (the numbers of training data.) However, there is mathematical limitation on scaling vertically (the numbers of training features) while many recent applications from document classification and computational linguistics are of this type. He will talk about how to address this problem by L-BFGS optimizer instead of Newton optimizer.
Bio:
DB Tsai is a machine learning engineer working at Alpine Data Labs. He is recently working with Spark MLlib team to add support of L-BFGS optimizer and multinomial logistic regression in the upstream. He also led the Apache Spark development at Alpine Data Labs. Before joining Alpine Data labs, he was working on large-scale optimization of optical quantum circuits at Stanford as a PhD student.
발표자: 이활석 (Naver Clova)
발표일: 2017.11.
(현) NAVER Clova Vision
(현) TFKR 운영진
개요:
최근 딥러닝 연구는 지도학습에서 비지도학습으로 급격히 무게 중심이 옮겨지고 있습니다.
특히 컴퓨터 비전 기술 분야에서는 지도학습에 해당하는 이미지 내에 존재하는 정보를 찾는 인식 기술에서,
비지도학습에 해당하는 특정 정보를 담는 이미지를 생성하는 기술인 생성 기술로 연구 동향이 바뀌어 가고 있습니다.
본 세미나에서는 생성 기술의 두 축을 담당하고 있는 VAE(variational autoencoder)와 GAN(generative adversarial network) 동작 원리에 대해서 간략히 살펴 보고, 관련된 주요 논문들의 결과를 공유하고자 합니다.
딥러닝에 대한 지식이 없더라도 생성 모델을 학습할 수 있는 두 방법론인 VAE와 GAN의 개념에 대해 이해하고
그 기술 수준을 파악할 수 있도록 강의 내용을 구성하였습니다.
Full paper: https://arxiv.org/pdf/1804.02339.pdf
We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.
A walk through the intersection between machine learning and mechanistic mode...JuanPabloCarbajal3
Talk at EURECOM, France.
It overviews regression in several of its forms: regularized, constrained, and mixed. It builds the bridge between machine learning and dynamical models.
Distributed Coordinate Descent for Logistic Regression with RegularizationИлья Трофимов
Logistic regression with L1 and L2 regularization is a widely used technique for solving
classication and class probability estimation problems. With the numbers of both featurescand examples growing rapidly in the fields like text mining and clickstream data analysis parallelization and the use of cluster architectures becomes important. We present a novel algorithm for tting regularized logistic regression in the distributed environment. The algorithm splits data between nodes by features, uses coordinate descent on each node and line search to merge results globally. Convergence proof is provided. A modications of the algorithm addresses slow node problem. We empirically compare our program with several state-of-the art approaches that rely on different algorithmic and data spitting methods. Experiments demonstrate that our approach is scalable and superior when training on large and sparse datasets.
----------------------------------------------------------
Machine Learning: Prospects and Applications
58 October 2015, Berlin, Germany
Typically quantifying uncertainty requires many evaluations of a computational model or simulator. If a simulator is computationally expensive and/or high-dimensional, working directly with a simulator often proves intractable. Surrogates of expensive simulators are popular and powerful tools for overcoming these challenges. I will give an overview of surrogate approaches from an applied math perspective and from a statistics perspective with the goal of setting the stage for the "other" community.
Basic concept of Deep Learning with explaining its structure and backpropagation method and understanding autograd in PyTorch. (+ Data parallism in PyTorch)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
Longitudinal and survival data are naturally observed with multiple origination dates. They form a dual-time data structure with horizontal axis representing the calendar time and the vertical axis representing the lifetime. In this talk we discuss how to model dual-time data based on a decomposition strategy and how to forecast over the time horizon. Various statistical techniques are used for treating fixed and random effects.
Among other fields, we share the potential applications in quantitative risk management, and demonstrate a large-scale credit risk analysis powered by big data computing.
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
Stochastic optimal control problems arise in many
applications and are, in principle,
large-scale involving up to millions of decision variables. Their
applicability in control applications is often limited by the
availability of algorithms that can solve them efficiently and within
the sampling time of the controlled system.
In this paper we propose a dual accelerated proximal
gradient algorithm which is amenable to parallelization and
demonstrate that its GPU implementation affords high speed-up
values (with respect to a CPU implementation) and greatly outperforms
well-established commercial optimizers such as Gurobi.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
Simulators play a major role in analyzing multi-modal transportation networks. As their complexity increases, optimization becomes an increasingly challenging task. Current calibration procedures often rely on heuristics, rules of thumb and sometimes on brute-force search. Alternatively, we provide a statistical method which combines a distributed, Gaussian Process Bayesian optimization method with dimensionality reduction techniques and structural improvement. We then demonstrate our framework on the problem of calibrating a multi-modal transportation network of city of Bloomington, Illinois. Our framework is sample efficient and supported by theoretical analysis and an empirical study. We demonstrate on the problem of calibrating a multi-modal transportation network of city of Bloomington, Illinois. Finally, we discuss directions for further research.
Evaluating machine learning models and their diagnostic valueGael Varoquaux
Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Measuring mental health with machine learning and brain imagingGael Varoquaux
The study of mental health relies vastly on behavior testing and questionnaires. I discuss how
machine learning on large brain-imaging cohorts can open new alleys for markers of mental health. My
claims are that challenges are the amount of diagnosed conditions rather than heterogeneity of the
conditions and that we should turn to proxy labels. I discuss another fundamental challenge to this
agenda: the external and construct validity of brain-imaging based markers.
A tutorial on machine learning to build prediction models with missing values.
The slides cover both theoretical results (statistical learning) and practical advice, with a focus on implementation in Python with scikit-learn
Dirty data science machine learning on non-curated dataGael Varoquaux
These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
Representation learning in limited-data settingsGael Varoquaux
A 4-hour long didactic course on simple notions of representations and how to use them in limited-data settings:
- A supervised learning point of view, giving intuitions and math on what are representations are why they matter
- Building simple unsupervised learning models to extract representation: from matrix decomposition for signals to embeddings of entities
- Evaluating models in limited-data settings, often a bottleneck
This slide-deck was given as a course at the 2021 DeepLearn summer school.
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
My current thoughts about methods validity and design in brain imaging.
Data processing is a significant part of a neuroimaging study. The choice of corresponding methods and tools is crucial. I will give an opinionated view how on a path to building better data processing for neuroimaging. I will take examples on endeavors that I contributed to: defining standards for functional-connectivity analysis, the nilearn neuroimaging tool, the scikit-learn machine-learning toolbox -an industry standard with a million regular users. I will cover not only the technical process -statistics, signal processing, software engineering- but also the epistemology of methods development. Methods govern our results, they are more than a technical detail.
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
Extracting Functional-Connectome Biomarkers with Machine Learning: a talk in the symposium on how do current predictive connectivity models meet clinician’s needs?
This talk is a bit provocative and first sets visions, before bringing a few technical suggestions
Atlases of cognition with large-scale human brain mappingGael Varoquaux
Cognitive neuroscience uses neuroimaging to identify brain systems engaged in specific cognitive tasks. However, linking unequivocally brain systems with cognitive functions is difficult: each task probes only a small number of facets of cognition, while brain systems are often engaged in many tasks. We develop a new approach to generate a functional atlas of cognition, demonstrating brain systems selectively associated with specific cognitive functions. This approach relies upon an ontology that defines specific cognitive functions and the relations between them, along with an analysis scheme tailored to this ontology. Using a database of thirty neuroimaging studies, we show that this approach provides a highly-specific atlas of mental functions, and that it can decode the mental processes engaged in new tasks.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
Informatics in the psychological sciences brings fascinating challenges as mental processes or pathologies have fuzzy definition and are hard to quantify. Brain imaging brings rich data on the neural substrate of these concepts, yet it is a non trivial link.
The goal of this presentation is to put forward basic ideas of "psychoinformatics", using advanced processing on brain images to quantify better the elements of psychology.
It discusses how machine learning can bridge brain images to behavior: to describe better mental processes involved in brain activity, or to extract biomarkers of pathologies, individual traits, or cognition.
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
Machine learning builds predictive models from the data. It is massive used on medical images these days, for a variety of applications ranging from segmentation to diagnosis.
This is an introductory tutorial to machine learning from giving intuitions on the statistical point of view. It introduce the methodology, the concepts behind the central models, the validation framework and some caveats to look for.
It also discusses some applications to drawing conclusions from brain imaging, and use these applications to highlight various technical aspects to running machine learning models on high-dimensional data such as medical imaging.
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
This talk describe our efforts to bring easily usable machine learning to brain mapping. It covers both questions that machine learning can answer as well as two softwares developed to facilitate machine learning and it's application to neuroimaging.
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
Talk given at the OHBM 2017 education course.
I present the challenges and techniques to estimating meaningful brain functional connectomes from fMRI: why sparsity in inverse covariance leads to models that can interpreted as interactions between regions.
Then I discuss the limitations of sparse estimators and introduce shrinkage as an alternative. Finally, I discuss how to compare multiple functional connectomes.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Scientist meets web dev: how Python became the language of dataGael Varoquaux
Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn’t get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist.
In this talk I give a personal perspective on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; And where this richness might lead us.
The talk will discuss low-level and high-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current technical details that make scikit-learn and joblib stand.
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
Machine learning is geared towards prediction. However, aside diagnosis or prognosis in the clinics, cognitive neuroimaging strives for uncovering insights from the data, rather than minimizing prediction error. I review various inferences on brain function that have been drawn using pattern recognition techniques, focusing on decoding. In particular, I discuss using generalization as a test for information, multivariate analysis to interpret overlapping activation patterns, and decoding for principled reverse inference. I give each time a statistical view and a cognitive imaging view.
Talk giving at PRNI 2016 for the paper https://arxiv.org/pdf/1606.06439v1.pdf
Abstract — Spatially-sparse predictors are good models for
brain decoding: they give accurate predictions and their weight
maps are interpretable as they focus on a small number of
regions. However, the state of the art, based on total variation or
graph-net, is computationally costly. Here we introduce sparsity
in the local neighborhood of each voxel with social-sparsity, a
structured shrinkage operator. We find that, on brain imaging
classification problems, social-sparsity performs almost as well as
total-variation models and better than graph-net, for a fraction
of the computational cost. It also very clearly outlines predictive
regions. We give details of the model and the algorithm
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
2. Settings: Very high dimensionality
- signals (images, spectra)
- many entities (customers, product)
- non-standardized categories (typos, variants)
Exploit links & redundancy across features
G Varoquaux 2
4. 1 Factorizing huge matrices
with A. Mensch, J. Mairal, B. Thirion
[Mensch... 2016, 2017]
samples
features
samples
features
Y +E · S= N
Challenge: scalability
1 Intuitions
2 Experiments
3 Algorithms
4 Proof
G Varoquaux 4
5. 1 Real world data: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
users
product
users
products
Y +E · S= N
G Varoquaux 5
6. 1 Real world data: brain imaging
Brain activity at rest
1000 subjects with ∼ 100–10 000
samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
time
voxels
time
voxels
time
voxels
Y +E · S=
25
N
G Varoquaux 6
7. 1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7
8. 1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
- Code com-
putation
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7
9. 1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
Rewrite as an expectation: [Mairal... 2010]
argmin
E i
mins
Yi − E sT 2
Fro + λΩ(s)
argmin
E
E f (E)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 7
10. 1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 7
11. 1 Scalable solvers for matrix factorizations
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
Online matrix factorization [Mairal... 2010]
G Varoquaux 7
12. 1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
G Varoquaux 7
13. 1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
Online matrix factorization [Mairal... 2010] [Mensch... 2017]
×10 speed up
G Varoquaux 7
14. 1 Experimental results: resting-state fMRI
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
100s 1000s 1h 5h 24h
1.02
1.04
1.06
Testobjectivevalue
×105
Time
HCP (3.5TB)
x 1e5 SGD (best step-size)
Online matrix factorization
Proposed SOMF (r = 12)
SOMF = Subsambled Online Matrix Factorization
G Varoquaux 8
15. 1 Experimental results: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue ×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 9
16. 1 Experimental results: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 10
17. 1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace
1
2
D DAt − D Bt
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11
18. 1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace
1
2
D DAt − D Bt
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
gt(D)
surrogate
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization
No nasty hyper-parameters
G Varoquaux 11
19. 1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function O(p)
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace
1
2
D DAt − D Bt
At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate O(p)
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11
20. 1 Sub-sample features
Data stream: (xt)t → masked
(Mtxt)t
Dimension: p → s
Use only Mtxt in computation
→ complexity in O(s) Mtxt
Stream
Ignore
p
n
1
Modify all steps to work on s features
Code
computation
Surrogate
update
Surrogate
minimization
G Varoquaux 12
22. 1 Sub-sample features – variance reduction
Original online MF
1. Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt)
2. Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xtαt − Bt−1)
3. Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At)j,j
(DAj
t−Bj
t))
Our algorithm
1. Approximate code computation: masked
β
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1Mtx(i)
G
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1MtDt−1
αt ← argmin
α∈Rk
1
2
α Gtα − α βt + λ Ω(α).
2. Surrogate aggregation, averaging
At =
1
wt
αtαt + (1 −
1
wt
)At−1
Pt
¯Bt ← (1 − wt)Pt
¯Bt−1 + wtPtxtαt
3. Surrogate minimization
PtDt ← argmin
Dr ∈Cr
1
2
tr(Dr
Dr ¯At) − tr(Dr
Pt
¯B
Pt
⊥ ¯Bt ← (1 − wt)Pt
⊥ ¯Bt−1 + wtPt
⊥
xtαt .
10−1 100 101
97000
97500
98000
98500
99000
99500
100000
Testobjectivefunction
Zoom
10−2
10−3
(relative to lowest value)
Subsampling ratio
None
r = 12
r = 24
100 101 Time
10−2
10−3
Code computation
No subsampling (19)
Averaged estimators (c)
Masked loss (a)
G Varoquaux 13
23. 1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
G Varoquaux 14
24. 1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
Surrogate computation SMM Full minimization
G Varoquaux 14
25. 1 Stochastic Approximate Majorization-Minimization
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization [Mairal 2013]
Surrogate computation
Surrogate approximation Partial minimization
SMM
SAMM
Full minimization
G Varoquaux 14
26. samples
features
samples
features
Y +E · S= N
Massive matrix factorization via subsampling
Subsampling features ⇒ doubly stochastic
10x speed ups on a fast algorithm
Analysis via stochastic approximate
majorization-minization
Conclusive on various high-dimensional problems
G Varoquaux 15
27. samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
When categories create a huge dimensionality
G Varoquaux 16
28. samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant IG Varoquaux 16
29. samples
features
samples
features
Y +E · S= N
2 Encoding with similarities
with P. Cerda and B. Kegl [Cerda... 2018]
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
A data cleaning problem?
A feature engineering problem?
A problem of representations in high dimension
G Varoquaux 16
30. 2 The problem of “dirty categories”
Non-curated categorical entries
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 17
31. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Employee Position Title
Master Police Officer
Social Worker IV
...
G Varoquaux 18
32. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Company name Frequency
Pfizer Inc. 79,073
Pfizer Pharmaceuticals LLC 486
Pfizer International LLC 425
Pfizer Limited 13
Pfizer Corporation Hong Kong Limited 4
Pfizer Pharmaceuticals Korea Limited 3
...
G Varoquaux 18
33. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
G Varoquaux 18
34. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
Cardinality slowly increases with number of rows
100 1k 10k 100k 1M
Number of rows
100
1 000
10 000
Numberofcategories
beer reviews
road safety
traffic violations
midwest survey
open payments
employee salaries
medical charges
100
√
n
5 log2(n)
Create a high-dimensional learning problem
G Varoquaux 18
35. 2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
Our goal: a statistical view of supervised
learning on dirty categories
The statistical question
should inform curation
Pfizer Corporation Hong Kong
=?
Pfizer Pharmaceuticals
Korea
G Varoquaux 18
36. 2 Related work: Database cleaning
Recognizing / merging entities
Record linkage:
matching across different (clean) tables
Deduplication/fuzzy matching:
matching in one dirty table
Techniques [Fellegi and Sunter 1969]
Supervised learning (known matches)
Clustering
Expectation Maximization to learn a metric
Outputs a “clean” database
G Varoquaux 19
37. 2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
G Varoquaux 20
38. 2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
G Varoquaux 20
39. 2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
Character-level NLP
For entity resolution [Klein... 2003]
For semantics [Bojanowski... 2017]
“London” & “Londres” may carry different information
G Varoquaux 20
40. 2 Similarity encoding: a simple solution
Adding similarities to one-hot encoding
1. One-hot encoding maps categories to vector spaces
2. String similarities capture information
G Varoquaux 21
41. 2 Similarity encoding: a simple solution
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?
G Varoquaux 22
42. 2 Similarity encoding: a simple solution
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?Similarity encoding
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 22
43. 2 Some string similarities
Levenshtein
Number of edit operations on one string to match
the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 23
44. 2 Empirical study
Datasets with dirty categories
Dataset # of
rows
# of cat-
egories
Less frequent
category
Prediction
type
medical charges 160k 100 613 regression
employee salaries 9.2k 385 1 regression
open payments 100k 973 1 binary clf
midwest survey 2.8k 1009 1 multiclass clf
traffic violations 100k 3043 1 multiclass clf
road safety 10k 4617 1 binary clf
beer reviews 10k 4634 1 multiclass clf
7 datasets! All open
Experimental paradigm
Cross-validation & measure prediction
Stupid Simple
G Varoquaux 24
45. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
46. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
47. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25
48. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25
49. 2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25
50. 2 Experiments: ridge
0.7 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.25 0.50
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.45 0.50
road
safety
0.25 0.75
beer
reviews
1.0
2.9
3.1
4.4
3.6
6.0
Average ranking across datasets
Similarity encoding, with 3-gram similarity
G Varoquaux 26
52. 2 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
G Varoquaux 28
53. 2 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
0.83 0.88
medical
charges
3-gram
Levenshtein-
ratio
Jaro-winkler
Bag of 3-grams
Target encoding
MDV
One-hot encoding
Hash encoding
Similarity
encoding
0.75 0.85
employee
salaries
0.3 0.5
open
payments
0.6 0.7
midwest
survey
0.72 0.78
traffic
violations
0.44 0.52
road
safety
0.3 0.8
beer
reviews
1.1
3.1
3.4
4.1
5.3
6.4
4.7
7.3
Similarity encoding >>> a feature map capturing string similarities
G Varoquaux 28
54. 2 Reducing the dimensionality
BX ∈ Rn×p
but p is large
Statistical problems
Computational problems
G Varoquaux 29
55. 2 Reducing the dimensionality
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
G Varoquaux 29
56. 2 Reducing the dimensionality
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
G Varoquaux 29
57. 2 Reducing the dimensionality
0.7 0.8 0.9
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8 0.6 0.7 0.7500.G Varoquaux 29
58. 2 Reducing the dimensionality
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
G Varoquaux 29
59. 2 Reducing the dimensionality
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
Factorizing one-hot: Multiple Correspondance Analysis
Hashing n-grams (for speed and collisions)
G Varoquaux 29
61. @GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
Factorizations
Costly in large-p, large-n
Sub-sampling p gives huge speed ups
Stochastic Approximate Majorization-Minimization
https://github.com/arthurmensch/modl
62. @GaelVaroquaux
Representations in high dimension
factorizations and similarities
signals, entities, categories
Factorizations
https://github.com/arthurmensch/modl
Similarity encoding for categories
No separate duplication / cleaning step
Creates a categorie-aware metric space
https://dirty-cat.github.io
DirtyData project (hiring)
63. References I
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the
Association of Computational Linguistics, 5(1):135–146, 2017.
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning,
pages 1–18, 2018.
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal
of the American Statistical Association, 64:1183, 1969.
D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity
recognition with character-level models. In Proceedings of the
seventh conference on Natural language learning at HLT-NAACL
2003-Volume 4, pages 180–183. Association for Computational
Linguistics, 2003.
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
64. References II
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for
matrix factorization and sparse coding. Journal of Machine
Learning Research, 11:19, 2010.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary
learning for massive matrix factorization. In ICML, 2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66(1):113–128, 2017.