A 4-hour long course given at the Deep learning 2019 summer school.
An updated version of this slide deck can be found here:
https://www.slideshare.net/GaelVaroquaux/representation-learning-in-limiteddata-settings-250095542
The topic is how to learn representations for machine learning when the amount of data is limited, for instance when the amount of samples is not large compared to the dimensionality of the problem, or when there is a lot of noise which renders learning difficult. This course bridge deep learning to more classic "shallow" learning techniques that work well in limited-data settings, with some theory and some practical recommendations.
1. Representations for machine learning: some learning theory results, some reflections on representations, and some simple models that extract representations.
2. Matrix factorizations: covering the wide spectrum from PCA to word2vec via dictionary learning and metric learning
3. Fisher kernels: building representations from likelihood models (slightly more academic)
Evaluating machine learning models and their diagnostic valueGael Varoquaux
Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Measuring mental health with machine learning and brain imagingGael Varoquaux
The study of mental health relies vastly on behavior testing and questionnaires. I discuss how
machine learning on large brain-imaging cohorts can open new alleys for markers of mental health. My
claims are that challenges are the amount of diagnosed conditions rather than heterogeneity of the
conditions and that we should turn to proxy labels. I discuss another fundamental challenge to this
agenda: the external and construct validity of brain-imaging based markers.
A tutorial on machine learning to build prediction models with missing values.
The slides cover both theoretical results (statistical learning) and practical advice, with a focus on implementation in Python with scikit-learn
Dirty data science machine learning on non-curated dataGael Varoquaux
These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
Representation learning in limited-data settingsGael Varoquaux
A 4-hour long didactic course on simple notions of representations and how to use them in limited-data settings:
- A supervised learning point of view, giving intuitions and math on what are representations are why they matter
- Building simple unsupervised learning models to extract representation: from matrix decomposition for signals to embeddings of entities
- Evaluating models in limited-data settings, often a bottleneck
This slide-deck was given as a course at the 2021 DeepLearn summer school.
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
My current thoughts about methods validity and design in brain imaging.
Data processing is a significant part of a neuroimaging study. The choice of corresponding methods and tools is crucial. I will give an opinionated view how on a path to building better data processing for neuroimaging. I will take examples on endeavors that I contributed to: defining standards for functional-connectivity analysis, the nilearn neuroimaging tool, the scikit-learn machine-learning toolbox -an industry standard with a million regular users. I will cover not only the technical process -statistics, signal processing, software engineering- but also the epistemology of methods development. Methods govern our results, they are more than a technical detail.
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
Extracting Functional-Connectome Biomarkers with Machine Learning: a talk in the symposium on how do current predictive connectivity models meet clinician’s needs?
This talk is a bit provocative and first sets visions, before bringing a few technical suggestions
Atlases of cognition with large-scale human brain mappingGael Varoquaux
Cognitive neuroscience uses neuroimaging to identify brain systems engaged in specific cognitive tasks. However, linking unequivocally brain systems with cognitive functions is difficult: each task probes only a small number of facets of cognition, while brain systems are often engaged in many tasks. We develop a new approach to generate a functional atlas of cognition, demonstrating brain systems selectively associated with specific cognitive functions. This approach relies upon an ontology that defines specific cognitive functions and the relations between them, along with an analysis scheme tailored to this ontology. Using a database of thirty neuroimaging studies, we show that this approach provides a highly-specific atlas of mental functions, and that it can decode the mental processes engaged in new tasks.
Evaluating machine learning models and their diagnostic valueGael Varoquaux
Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Measuring mental health with machine learning and brain imagingGael Varoquaux
The study of mental health relies vastly on behavior testing and questionnaires. I discuss how
machine learning on large brain-imaging cohorts can open new alleys for markers of mental health. My
claims are that challenges are the amount of diagnosed conditions rather than heterogeneity of the
conditions and that we should turn to proxy labels. I discuss another fundamental challenge to this
agenda: the external and construct validity of brain-imaging based markers.
A tutorial on machine learning to build prediction models with missing values.
The slides cover both theoretical results (statistical learning) and practical advice, with a focus on implementation in Python with scikit-learn
Dirty data science machine learning on non-curated dataGael Varoquaux
These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
Representation learning in limited-data settingsGael Varoquaux
A 4-hour long didactic course on simple notions of representations and how to use them in limited-data settings:
- A supervised learning point of view, giving intuitions and math on what are representations are why they matter
- Building simple unsupervised learning models to extract representation: from matrix decomposition for signals to embeddings of entities
- Evaluating models in limited-data settings, often a bottleneck
This slide-deck was given as a course at the 2021 DeepLearn summer school.
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
My current thoughts about methods validity and design in brain imaging.
Data processing is a significant part of a neuroimaging study. The choice of corresponding methods and tools is crucial. I will give an opinionated view how on a path to building better data processing for neuroimaging. I will take examples on endeavors that I contributed to: defining standards for functional-connectivity analysis, the nilearn neuroimaging tool, the scikit-learn machine-learning toolbox -an industry standard with a million regular users. I will cover not only the technical process -statistics, signal processing, software engineering- but also the epistemology of methods development. Methods govern our results, they are more than a technical detail.
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
Extracting Functional-Connectome Biomarkers with Machine Learning: a talk in the symposium on how do current predictive connectivity models meet clinician’s needs?
This talk is a bit provocative and first sets visions, before bringing a few technical suggestions
Atlases of cognition with large-scale human brain mappingGael Varoquaux
Cognitive neuroscience uses neuroimaging to identify brain systems engaged in specific cognitive tasks. However, linking unequivocally brain systems with cognitive functions is difficult: each task probes only a small number of facets of cognition, while brain systems are often engaged in many tasks. We develop a new approach to generate a functional atlas of cognition, demonstrating brain systems selectively associated with specific cognitive functions. This approach relies upon an ontology that defines specific cognitive functions and the relations between them, along with an analysis scheme tailored to this ontology. Using a database of thirty neuroimaging studies, we show that this approach provides a highly-specific atlas of mental functions, and that it can decode the mental processes engaged in new tasks.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
Informatics in the psychological sciences brings fascinating challenges as mental processes or pathologies have fuzzy definition and are hard to quantify. Brain imaging brings rich data on the neural substrate of these concepts, yet it is a non trivial link.
The goal of this presentation is to put forward basic ideas of "psychoinformatics", using advanced processing on brain images to quantify better the elements of psychology.
It discusses how machine learning can bridge brain images to behavior: to describe better mental processes involved in brain activity, or to extract biomarkers of pathologies, individual traits, or cognition.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
Machine learning builds predictive models from the data. It is massive used on medical images these days, for a variety of applications ranging from segmentation to diagnosis.
This is an introductory tutorial to machine learning from giving intuitions on the statistical point of view. It introduce the methodology, the concepts behind the central models, the validation framework and some caveats to look for.
It also discusses some applications to drawing conclusions from brain imaging, and use these applications to highlight various technical aspects to running machine learning models on high-dimensional data such as medical imaging.
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
This talk describe our efforts to bring easily usable machine learning to brain mapping. It covers both questions that machine learning can answer as well as two softwares developed to facilitate machine learning and it's application to neuroimaging.
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
Talk given at the OHBM 2017 education course.
I present the challenges and techniques to estimating meaningful brain functional connectomes from fMRI: why sparsity in inverse covariance leads to models that can interpreted as interactions between regions.
Then I discuss the limitations of sparse estimators and introduce shrinkage as an alternative. Finally, I discuss how to compare multiple functional connectomes.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Scientist meets web dev: how Python became the language of dataGael Varoquaux
Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn’t get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist.
In this talk I give a personal perspective on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; And where this richness might lead us.
The talk will discuss low-level and high-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current technical details that make scikit-learn and joblib stand.
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
Machine learning is geared towards prediction. However, aside diagnosis or prognosis in the clinics, cognitive neuroimaging strives for uncovering insights from the data, rather than minimizing prediction error. I review various inferences on brain function that have been drawn using pattern recognition techniques, focusing on decoding. In particular, I discuss using generalization as a test for information, multivariate analysis to interpret overlapping activation patterns, and decoding for principled reverse inference. I give each time a statistical view and a cognitive imaging view.
Talk giving at PRNI 2016 for the paper https://arxiv.org/pdf/1606.06439v1.pdf
Abstract — Spatially-sparse predictors are good models for
brain decoding: they give accurate predictions and their weight
maps are interpretable as they focus on a small number of
regions. However, the state of the art, based on total variation or
graph-net, is computationally costly. Here we introduce sparsity
in the local neighborhood of each voxel with social-sparsity, a
structured shrinkage operator. We find that, on brain imaging
classification problems, social-sparsity performs almost as well as
total-variation models and better than graph-net, for a fraction
of the computational cost. It also very clearly outlines predictive
regions. We give details of the model and the algorithm
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
We present an automated pipeline to learn predictive biomarkers from resting-state fMRI. We apply it to classifying autism on unseen sites, demonstrating the feasibility of biomarkers on weakly standardized functional imaging data.
We study the steps of the pipeline that are important to predict and can show that 1) the choice of atlas is the most important choice. Ideally the atlas should be made of functional regions learned from the data. 2) "tangent space" parametrization of the connectivity is the best performer.
We conclude on general recommendations for predictive biomarkers from resting-state fMRI
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Building a cutting-edge data processing environment on a budgetGael Varoquaux
As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
Informatics in the psychological sciences brings fascinating challenges as mental processes or pathologies have fuzzy definition and are hard to quantify. Brain imaging brings rich data on the neural substrate of these concepts, yet it is a non trivial link.
The goal of this presentation is to put forward basic ideas of "psychoinformatics", using advanced processing on brain images to quantify better the elements of psychology.
It discusses how machine learning can bridge brain images to behavior: to describe better mental processes involved in brain activity, or to extract biomarkers of pathologies, individual traits, or cognition.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
Machine learning builds predictive models from the data. It is massive used on medical images these days, for a variety of applications ranging from segmentation to diagnosis.
This is an introductory tutorial to machine learning from giving intuitions on the statistical point of view. It introduce the methodology, the concepts behind the central models, the validation framework and some caveats to look for.
It also discusses some applications to drawing conclusions from brain imaging, and use these applications to highlight various technical aspects to running machine learning models on high-dimensional data such as medical imaging.
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
This talk describe our efforts to bring easily usable machine learning to brain mapping. It covers both questions that machine learning can answer as well as two softwares developed to facilitate machine learning and it's application to neuroimaging.
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
Talk given at the OHBM 2017 education course.
I present the challenges and techniques to estimating meaningful brain functional connectomes from fMRI: why sparsity in inverse covariance leads to models that can interpreted as interactions between regions.
Then I discuss the limitations of sparse estimators and introduce shrinkage as an alternative. Finally, I discuss how to compare multiple functional connectomes.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Scientist meets web dev: how Python became the language of dataGael Varoquaux
Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn’t get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist.
In this talk I give a personal perspective on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; And where this richness might lead us.
The talk will discuss low-level and high-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current technical details that make scikit-learn and joblib stand.
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
Machine learning is geared towards prediction. However, aside diagnosis or prognosis in the clinics, cognitive neuroimaging strives for uncovering insights from the data, rather than minimizing prediction error. I review various inferences on brain function that have been drawn using pattern recognition techniques, focusing on decoding. In particular, I discuss using generalization as a test for information, multivariate analysis to interpret overlapping activation patterns, and decoding for principled reverse inference. I give each time a statistical view and a cognitive imaging view.
Talk giving at PRNI 2016 for the paper https://arxiv.org/pdf/1606.06439v1.pdf
Abstract — Spatially-sparse predictors are good models for
brain decoding: they give accurate predictions and their weight
maps are interpretable as they focus on a small number of
regions. However, the state of the art, based on total variation or
graph-net, is computationally costly. Here we introduce sparsity
in the local neighborhood of each voxel with social-sparsity, a
structured shrinkage operator. We find that, on brain imaging
classification problems, social-sparsity performs almost as well as
total-variation models and better than graph-net, for a fraction
of the computational cost. It also very clearly outlines predictive
regions. We give details of the model and the algorithm
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
We present an automated pipeline to learn predictive biomarkers from resting-state fMRI. We apply it to classifying autism on unseen sites, demonstrating the feasibility of biomarkers on weakly standardized functional imaging data.
We study the steps of the pipeline that are important to predict and can show that 1) the choice of atlas is the most important choice. Ideally the atlas should be made of functional regions learned from the data. 2) "tangent space" parametrization of the connectivity is the best performer.
We conclude on general recommendations for predictive biomarkers from resting-state fMRI
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Building a cutting-edge data processing environment on a budgetGael Varoquaux
As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
2. Limited-data settings
n to be compared to:
A measure of the signal-to-noise ratio
The dimensional of the data p
Deep learning does not work well in
small-sample regimes
But we can borrow ideas
This talk: No silver bullet,
many simple (shallow) tricks
G Varoquaux 1
3. Small-n problems are important
83% of data scientists1 never have n > 1M
n is often small for applications such as medicine
Bigger is better (how to not use this talk)
Get more data (pool related datasets)
Find a related problem and try transfer
This talk: data that differs from common sources
1www.kaggle.com/laurae2/data-scientists-vs-size-of-datasetsG Varoquaux 2
4. Perils of deep learning with small n
Selecting architecture, learning rate...
A deep architecture is validated by its measured accuracy
overfitting the validation & test set
Sampling noise for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Optimizing test accuracy will explore the tails
cf online challenges
Need for guiding principles
G Varoquaux 3[Varoquaux 2018]
5. Outline
1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
2 Matrix factorization and its variants
For signals
For discrete objects
3 Fisher kernels
Kernels feature maps
From likelihoods to Kernels
G Varoquaux 4
6. 1 Representations for machine
learning
Defining the notion of representations
Their use for supervised learning
7. 1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
8. Settings: supervised learning
Given n pairs (x, y) ∈ X × Y drawn i.i.d.
find a function f : X → Y such that f(x) ≈ y
Notation: ˆy
def
= f(x)
Empirical risk minimization
Loss function l : Y × Y →
Estimation of f: f = argmin
f∈F
¾ l(ˆy, y)
This course: how to choose good function classes F
G Varoquaux 7
10. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
G Varoquaux 8
11. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
G Varoquaux 8
12. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
G Varoquaux 8
13. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
G Varoquaux 8
14. Example: finite-sample estimation of f
Data generated
with 9th order
polynomial
+ noise
Fit polynomials of
various degrees
Degree 1
Degree 2
Degree 5
Degree 9
Truth
Model too simple: underfit
Model too complex: overfit
G Varoquaux 8
15. Theory: the generalization error
Generalization error of a prediction function f:
Notation : E(f)
def
= ¾ l(y, f(x))
Finite-sample regime
Ideally: f = argmin
f∈F
¾ l f(x), y
In practice: ˆf = argmin
f∈F
n
i=1
l f(xi), yi
E(ˆf) ≥ E(f )
f
f
G Varoquaux 9
16. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Bayes rate
Best possible pre-
diction
¾ l(g(x) + e, g(x))
Approximation
error: g F
Our model is
wrong
Estimation
Sampling noise on
train data
ˆf f
G Varoquaux 10
17. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Bayes rate
Best possible pre-
diction
¾ l(g(x) + e, g(x))
Due to the noise e
Cannot be avoided
G Varoquaux 10
18. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Approximation
error: g F
Our model is
wrong
Decreases for larger F
Empirical upper bound:
train error
G Varoquaux 10
19. Theory: decomposing the generalization error
Assuming y = g(x) + e, e random with ¾[e] = 0,
the generalization error of ˆf is:
E(ˆf) = ¾ l(g(x) + e, ˆf(x))
= E(g) + E(f ) − E(g) + E(ˆf) − E(f )
Estimation
Sampling noise on
train data
ˆf f
Finite-sample problem
Decreases as n grows
Increases for larger F
Guesstimate: difference be-
tween train and test error
G Varoquaux 10
20. Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
f f
g
Degree 1, large n
small estimation error
large approximation
error
G Varoquaux 11
21. Example: polynomial regression degree
f
f
Degree 9, small n
no approximation error
large estimation error
ˆf = argminf∈F i l f(xi), yi
f f
g
Degree 1, large n
small estimation error
large approximation
error
Function class F not
restrictive enough
Function class F too
restrictive
G Varoquaux 11
22. Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error
sklearn.model selection.learning curve
G Varoquaux 12
Overfit
region
Underfit? Or Bayes rate?
23. Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
sklearn.model selection.learning curve
G Varoquaux 12
Estimation error ∼ gap be-
tween train and test error
24. Gauging overfit vs underfit: learning curves
100 1000
Number of samples
Error Generalization error
Training error
Degree of polynomial
9 1
Simpler models reach the assymptotic regime faster
(smaller “sample complexity”)
But can underfit
G Varoquaux 12
25. Gauging overfit vs underfit: validation curves
5 10 15
Polynomial degree
Error
Generalization error
Training error
sklearn.model selection.validation curve
Reveals underfits
G Varoquaux 13
26. Linear models for limited-data settings
In high-dimensional limited-data settings,
linear models are often the best choice
For p-dimensional data, x ∈ p,
they have p parameters
n ∼ 200 000
Inpatient Mortality, AUROC (95% CI) Hospital A Hospital B
Deep learning 0.95(0.94-0.96) 0.93(0.92-0.94)
Baseline (logistic regression) 0.93(0.92-0.95) 0.91(0.89-0.92)
G Varoquaux 14
27. Theory: Approximating with linear predictors
Linear predictor1: ˆy = xTw, w ∈ p
Data model: y = xTw + δ(x) + e ¾[e] = 0
xTw : best linear predictor
Ridge estimator:
ˆw = argmin
w
ytrain − Xtrainw 2
Fro + λ w 2
2
Error compared to best linear predictor:
¾ y − xT ˆw 2
2 = ¾ y − xTw 2
2 + o σ2p/ntrain
[Hsu... 2014, sec 2.5]
Random design analysis can characterize the generalization
error without assuming a correct data-generating model
(miss-specified model) [Hsu... 2014, Rosset and Tibshirani 2018]
1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
28. Theory: Approximating with linear predictors
Linear predictor1: ˆy = xTw, w ∈ p
Data model: y = xTw + δ(x) + e ¾[e] = 0
xTw : best linear predictor
Ridge estimator:
ˆw = argmin
w
ytrain − Xtrainw 2
Fro + λ w 2
2
Error compared to best linear predictor:
¾ y − xT ˆw 2
2 = ¾ y − xTw 2
2 + o σ2p/ntrain
Approximation error
Data not linearly generated
⇒ craft more features
Estimation error
Curse of dimensionality
⇒ limit number of features
1Predictor, not model: we do not assume it is a data-generating process.G Varoquaux 15
29. Example: extrapolating sea level (tides)
Predict sea level as a function of time
Test outside of observed range1
1Technically, this is not in our theory: test set train set.G Varoquaux 16
36. Example: extrapolating sea level (tides)
Polynomial regression
dim=10
dim=100
dim=1000
Covariates
Sines and cosines basis
dim=10
dim=100
dim=1000
Choice of covariates / basis / signal representation
⇒ huge difference on approximation error
⇒ huge difference on generalization error
G Varoquaux 16
37. Summary
ˆy = f(x), f chosen in F
to minimize the observed error
i∈train
l f(xi), y
generalization error:
- approximation error ⇒ F adapted to the data
- estimation error ⇒ F small
Limited-data settings
Linear models best option when p n
A good choice of covariates is crucial
G Varoquaux 17
38. 1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
39. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
G Varoquaux 19
40. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
Composing L rectifying functions on intermediate representa-
tions of dimension k gives O k
p
p(L−1)
kp linear regions.
Basis expansion + linear predictor gives O(k)
Exponential in depth, linear with dimension [Montufar... 2014]
G Varoquaux 19
41. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
y multidimensional
G Varoquaux 19
42. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
A good choice of z can decrease sample complexity
G Varoquaux 19
43. Representations to build F
Settings
z = r(x): representation of the data, z ∈ k
Predictor f : x → ˆy = hw r(x)
Function composition: “depth”
Benefits
For expressiveness composition basis expansion
For multi-tasks sharing representations across tasks
For limited data hw(z) = wTz, a linear predictor
Transfer: r is learned on large data; a simple h used.
G Varoquaux 19
44. Background: Information theory
Entropy = amount of information in x
H(x) = ¾p[− log p(x)]
Equi-probable distribution
= high entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Uneven distribution
= low entropy x=0 x=1 x=2 x=3 x=4 x=5
P
Mutual information between x and y
I(x; y) = H(x, y) − H(x) − H(y)
x ⊥⊥ y (independent) ⇔ I(x; y) = 0
independence ⇔ p(x; y) = p(x)p(y)
H(x; y) = ¾(x;y) log p(x; y) = ¾(x;y) log p(x) + log p(y)
x
y
= ¾x log p(x) + ¾y log p(y) = H(x) + H(y)
G Varoquaux 20
45. Theory: information in representations
A representation z of x is sufficient for y if y ⊥⊥ x|z,
or equivalently if I(z; y) = I(x; y)
x, z, y form a Markov chain if (y|x, z) = (y|z).
x → z → y
Data processing inequality: I(x; y) ≤ I(x; z)
A sufficient representation z is minimal when
I(x; z) is smallest among sufficient representations
G Varoquaux 21[Achille and Soatto 2018]
46. Nuisances and invariances
A nuisance n: I(x, n) ≥ 0, but I(y, n) = 0
Representation z is invariant to the nuisance n
if z ⊥⊥ n, or I(z; n) = 0 ⇒ We want I(z; n) low
In a Markov chain x → z1 → z2 · · · → zL → y
If z is a sufficient representation for y,
I(z; n) ≤ I(z; x) − I(x; y)
Communication bottleneck: I(z1; z2) < I(z1; x)
⇒ I(z2; n) ≤ I(z1; z2) − I(x; y)
Stacking increases invariance
G Varoquaux 22[Achille and Soatto 2018]
47. Invariant representations on a continous space
st
Shift invariance representation = Fourier basis
Fourier transform: F(s)f =
t
e−i f t
st
complex i
Shifting the signal: st → st = st+k
F(s )f =
t
e−i f t
st+k =
t
e−i f (t−k)
st = ei k f
t
e−i f t
st
= ei k f
F(s)f → change in phase
An orthonormal basis
of shift-invariant vectors
G Varoquaux 23
48. Invariant representations on a continous space
st
Shift invariance = Fourier basis
Local deformations = Wavelets
Locally equivalent to Fourier basis
But without the global extent
Decimated wavelets
Isometric transform of the signal
Higher scales lose shift invariance
Redundant wavelets
Increase the dimensionality
Good shift invariance
G Varoquaux 23
49. Representations invariant to rich deformations
Scaling
Rotations
Deformations
Ingredients
Modulus of wavelet / Fourier transform
⇒ non linearity & filter banks (convolutions)
+ stacking (repeating simple invariants)
Scattering transform
Derived from first principles
Building first-order invariants
Convolutional networks
Learned from data
Pooling across pixels (eg max)
G Varoquaux 24[Mallat 2016]
50. Summary
Intermediate representations give
expressiveness to predictive models
Good representations keep predictive information
and loose nuisance information
Bottleneck and regularization to loose information
Limited-data settings
Given know invariants of the problem,
reusing existing representations helps
eg Headless conv-net, wavelets... [Oyallon... 2017]
G Varoquaux 25
51. 1 Representations for machine learning
Non-asymptotic supervised learning
Learning with representations
Supervised learning of representations
52. The need to supervision
Maximizing I(z; y) (≤ I(x; y)) sufficient representations
⇒ supervised learning
while minimizing I(z; n) nuisance
⇒ sampling nuisance / invariants
data augmentation
Challenge: amount of labeled data
Pretext tasks
Other targets y that capture useful information
Finding them needs domain knowledge
G Varoquaux 27
53. Deep architectures
...
ˆy = fd
Wd
◦ ... ◦ f1
W1
(x)
Typically fk
Wk
(x) = gk
(WT
k x) and gk
element-wise non-linearity
Thus ˆy = gd
WT
d ... g1
(WT
1 x)
Stacked representations: Wk
{Wk} optimized to minimize a prediction error
G Varoquaux 28
54. Shallow architectures for limited data
Keep one
latent layer
2
Without non-linearity:
ˆy = xT
W1 W2, y ∈ k
W1 ∈ p×d
W2 ∈ d×k
,
factored / reduced-rank linear model
Multi-task / multi-output literature
⇒ structured loss (multiple soft-max’s)
Overparametrization sometimes useful: d > k
can be achieved with dropout
G Varoquaux 29[Bzdok... 2015, Mensch... 2018]
55. Simple case: square loss = reduced rank regression
ˆY = X W1 W2, Y ∈ n×k
W1 ∈ p×d
, W2 ∈ d×k
ˆW1, ˆW2 = argmin
W1,W2
ˆY − Ytrain
2
Fro For squared loss the
problem is convex
Full-rank solution1 (X and Y on train set):
ˆW = ˆΣ−1
X XT
Y ˆY = X ˆW = X ˆΣ−1
X XT
Y
Rank d solution: [Izenman 1975, Rahim... 2017b]
ˆRd
def
= YT ˆY ∈ k×k SVD
→ = ˆUd ˆsd
ˆVd, ˆUd ∈ k×d
then ˆW1 = Σ−1
X
XTY ˆUd
ˆW2 = ˆUT
d
Full-rank solution Rank-k projector2
1No need for pesky SGDs
2The projector captures the variance explained on the multiple outputsG Varoquaux 30
56. Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Directly supervising z:
z = ˆy for a (simple) predictive model
Trick: “cross-fit” during training
obtain ˆy by splitting the training data
Testset Trainset
Fulldata
(in sklearn: cross val predict)
G Varoquaux 31
57. Model stacking
x
f1
→ z
f2
→ y
Learn f1 separately
Directly supervising z:
z = ˆy for a (simple) predictive model
Trick: “cross-fit” during training
obtain ˆy by splitting the training data
Testset Trainset
Fulldata
(in sklearn: cross val predict)
Application: tackling dimensionality [Rahim... 2017a]
Some features are a high-dimensional signal
eg medical images
f1: linear to reduce signal features
f2: non-linear (eg trees) on all features
G Varoquaux 31
58. Model stacking to encode discrete items
Sex Date Hired Employee Position
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
predict
→
Salary
69222.18
97392.47
104717.28
Difficulty: number of different positions
what invariants?
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Target encoding1 [Micci-Barreca 2001]
position → ¾train[salary|position]
1To inject categories in , before a second level that combines all columnsG Varoquaux 32
59. Summary
Supervision helps selecting
the relevant part of the signal
In limited-sample settings, simple
models can create representations
Simple latent-factor models
Multi-output models
Stacking: fit a first-level model
G Varoquaux 33
60. Summary of first section
For generalization: small family of functions fw that
approximate the signal well
Generalization of a linear predictor:
approximation error + o(p/ntrain
)
Predictors by composition: ˆy = f2(z), z = f1(x)
x
f1
→ z
f2
→ y ideally, f1 makes z invariant to nuisances
Reuse representations with the right invariances:
wavelets, fasttext, pretrained headless neural nets
Simple supervised models
can create representations
stacking multioutput pretext tasks
G Varoquaux 34
61. 2 Matrix factorization and its
variants
Simple unsupervised representation learning
More unlabeled data than labeled data
Learn representations and transfer them
Here: Focus on simple models for limited n or low SNR settings
Particularly interesting regime: p large and n large.
63. Principal Component Analysis
Find the directions of largest variance
Computation
X ∈ n×p ΣX = XTX ∈ p×p
PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ n×k
G Varoquaux 37
64. Principal Component Analysis
Find the directions of largest variance
Computation
X ∈ n×p ΣX = XTX ∈ p×p
PCA projector: PPCA ∈ p×k SVDk(X) or EVDk(ΣX)
Reduced X: X PPCA ∈ n×k
Model: low-rank Gaussian latent factors
X ≈ U V + E, E ∼ N(0, Ip), U ∈ n×k, V ∈ k×p
ˆU, ˆV = argmin
U,V
X − U V 2
Fro
Rotationally invariant: U = U O, OT V also solution for O s.t. OTO = I
G Varoquaux 37
65. Principal Component Analysis
Find the directions of largest variance
In a learning pipeline
Useful for dimensionality reduction (eg p is large)
Eases statistics and computations
Generalization error of PCA + OLS
within a factor of 4 of ridge
[Dhillon... 2013]
G Varoquaux 37
66. Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
True sources, signals U
Observations (mixed signal)
ICA recovered signals
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
67. Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ p×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Latent signals V Observed data U V
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
68. Beyond variance: Independent Component Analysis
Separate out signals U observed mixed1
Model: X = U V V ∈ p×p, VTV = Ip
If V is Gaussian, the model is not identifiable
Seek low mutual information across {uj}
⇒ Maximally non-Gaussian marginals [Cardoso 2003]
Computation: FastICA [Hyv¨arinen and Oja 2000]
Power iterations on V
Each time:
- apply a smooth increasing non-linearity on {uj}
- decorrelate
Preprocessing: whiten the data eg with PCA
1Classic ICA has no noise model: it does not do dimension reductionG Varoquaux 38
69. ICA to learn representations
Across patches of natural images:
Gabor-like filters
Similar to wavelets
and first layer of convnets
G Varoquaux 39[Hyv¨arinen and Oja 2000]
70. Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse
U ∈ n×k, V ∈ k×p
k can be > p (overcomplete dictionary)
Estimation: ˆU, ˆV = argmin
U,V,
s.t. vi
2
2 ≤1
X − U V 2
Fro + λ U 1
Combining squared loss and
1 penalty creates sparsity
Constraint on vi
2
2 required to
avoid cancelling out penalty with
V → ∞ and U → 0
x2
x1
G Varoquaux 40
71. Dictionary learning
Find vectors V that represents well the signal
with sparse combinations U
Model: X = U V s.t. U is sparse
U ∈ n×k, V ∈ k×p
k can be > p (overcomplete dictionary)
Estimation: ˆU, ˆV = argmin
U,V,
s.t. V∈C
X − U V 2
Fro + λΩ(U)
Constraint set and penalty can be varied1
Typically, 2, 1, and positivity2 on U or V.
1Fast when C and Ω lead to simple projections and penalized regression.
2Recovers a form of NMF (non-negative matrix factorization)G Varoquaux 40
72. Sparse dictionary learning to learn representations
Across patches of natural images:
Also learns Gabor-like filters1
Good for sparse models,
eg for denoising
1as ICA, K-Means, etc on images patchesG Varoquaux 41[Mairal... 2014]
73. Large n large p: brain imaging
Brain activity at rest
1000 subjects with
∼ 100–10 000 samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
G Varoquaux 42
voxels
time
voxels
time
X +U · V= E
25
74. Large n large p: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
G Varoquaux 43
users
product
users
products
X +U · V= E
75. Online estimation: stochastic optimization
min
w
i
l(xi w)
Many samples min
w
¾[l(y, x w)]
Gradient descent: wt+1 ← wt + αt wl
Stochastic gradient descent: wt+1 ← wt + αt¾[ wl]
Use a cheap estimate of ¾[ wl] (e.g. subsampling)
αt must decrease
“suitably” with t.
Those pesky learning rate
G Varoquaux 44
76. Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Alternating
minimization
Data
matrix
Large matrices
= terabytes of data
argmin
U,V
X−U V 2
Fro + λΩ(U)
G Varoquaux 45[Mairal... 2010]
77. Online estimation for matrix factorization
Large matrices
= terabytes of data
argmin
U,V
X−U V 2
Fro + λΩ(U)
Rewrite as an expectation:
argmin
V i
min
u
Xi − V u 2
Fro + λΩ(u)
argmin
E
f(V)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 45[Mairal... 2010]
78. Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 45[Mairal... 2010]
79. Online estimation for matrix factorization
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
Subsampled
& online
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
G Varoquaux 45[Mensch... 2017]
80. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
G Varoquaux 46
81. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2
gt(V)
surrogate
=
x
l(x, V) ui is used, and not u
G Varoquaux 46
82. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2 = tr
1
2
V VAt − V Bt
At
def
= (1 −
1
t
)At−1 +
1
t
utut Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtut
At and Bt are sufficient statistics of the loss
accumulated over the data
G Varoquaux 46
83. Online matrix factorization algorithm [Mairal... 2010]
Stream samples xt:
1. Compute code
ut = argmin
u∈ k
xt − Vt−1u 2
2 + λΩ(u)
2. Update the surrogate function
gt(V) =
1
t
t
i=1
xi − V ui
2
2 = tr
1
2
V VAt − V Bt
At
def
= (1 −
1
t
)At−1 +
1
t
utut Bt
def
= (1 −
1
t
)Bt−1 +
1
t
xtut
3. Minimize surrogate
Vt = argmin
V∈C
gt(V) gt = VAt − Bt
G Varoquaux 46
84. Stochastic Majorization-Minimization [Mairal 2013]
V = argmin
V∈C x
l(x, V) where l(x, V) = min
u
f(x, V, u)
Algorithm:
gt(V)
majorant
=
x
l(x, V) ui is used, and not u
⇒ Majorization-Minimization scheme1
Surrogate computation SMM Full minimization
2nd order information No learning rate
1SOMF uses a approximate majorant and minimization [Mensch... 2017]G Varoquaux 47
85. Experimental convergence: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue
×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 48
87. Summary
Versatile matrix-factorization formulation1
argmin
U∈ n×k,V∈C
X − U V 2
Fro + λΩ(U)
Estimation
Stochastic majorization miniminization2
⇒ an online alternated optimization
Example use of learned representations
Biomakers of autism on brain images:
p ∼ 100 000, n ∼ 1 000 [Abraham... 2017]
11-layer linear autoencoder
2Common case algorithm readily usable in scikit-learn:
MiniBatchDictionaryLearningG Varoquaux 50
89. Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2019]
=⇒ Poisson loss, instead of squared loss
(xj|u, V) = Poisson (u V)j = 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior3
(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
3Because it is the conjugate prior of the Poisson, it imposes soft sparsity,
and it raises rotational invarianceG Varoquaux 52
90. Gamma-Poisson for factorizing counts [Canny 2004]
When X is a matrix of counts
- Recommenders systems [Gopalan... 2014]
- Database string entries [Cerda and Varoquaux 2019]
=⇒ Poisson loss, instead of squared loss
(xj|u, V) = Poisson (u V)j = 1/xj! (u V)
xj
j
e−(u V)j
u are loadings, modeled as random with a
Gamma prior3
(ui) =
u
αi−1
i
e−ui/βi
β
αi
i
Γ(αi)
Maximum a posteriori estimation:
ˆU, ˆV = argmin
U,V
−
j
log (xj|u, V) +
i
log (ui)
3Because it is the conjugate prior of the Poisson, it imposes soft sparsity,
and it raises rotational invarianceG Varoquaux 52
91. Gamma-Poisson estimation
Full log-likelihood expression:
log L =
p
j=1
xj log((u V)j) − (u V)j − log(xj!)
+
k
i=1
(αi − 1) log(ui) −
ui
βi
− αi log βi − log Γ(αi)
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
G Varoquaux 53
92. Gamma-Poisson estimation
Gradients: ∂
∂Vij
log L =
xj
(u V)j
ui − ui
∂
∂ui
log L =
p
j=1
xj
(u V)j
Vij − Vij +
αi − 1
ui
−
1
βi
Equivalent to some NMF formulation: multiplicative updates1
Vij ← Vij
n
=1
x j
(UV) j
u i
n
=1
u i
−1
u i ← u i
p
j=1
x j
(UV) j
Vij +
αi − 1
u i
p
j=1
Vij + β−1
i
−1
1Efficient implementation with sparse matrices: the summations can be
done only on non-zero entries of X.G Varoquaux 53
93. Adapt the majorization minimization algorithm
while V(t) − V(t−1)
F > η do
draw xt from the training set.
while ut − uold
t 2 > do
ut ← ut.
xt
utV(t) V(t)T + a−1
ut
. 1 V(t)T + b−1 .−1
At ← V(t). uT
t
xt
utV(t)
Bt ← uT
t 1
A(t) ← ρ A(t−1) + A(t)
B(t) ← ρ B(t−1) + B(t)
V(t) ← A(t)./ B(t)
t ← t + 1
G Varoquaux 54[Lefevre... 2011, Cerda and Varoquaux 2019]
94. Application: sub-string representation
Problem: representing non-normalized categories
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 55[Cerda and Varoquaux 2019]
95. Application: sub-string representation
Gamma-Poisson
factorization
on sub-strings counts
3-gram1
P
3-gram2
ol
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 56[Cerda and Varoquaux 2019]
96. Application: sub-string representation
Gamma-Poisson
factorization
on sub-strings counts
3-gram1
P
3-gram2
ol
3-gram3
ic...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 56[Cerda and Varoquaux 2019]
97. Application: sub-string representation
Representations that extract latent categories
library
perator
cialist
rehouse
manager
mmunity
rescue
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
am
es
Categories
G Varoquaux 57[Cerda and Varoquaux 2019]
98. Application: sub-string representation
Inferring plausible feature names
ntant,
assistant,
library
ator,
equipment,
operator
dministration,
specialist
,
craftsworker,
warehouse
rossing,
program,
manager
cian,
mechanic,
community
efighter,
rescuer,
rescue
onal,
correction,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
Inferred
featurenam
es
Categories
G Varoquaux 57[Cerda and Varoquaux 2019]
99. Natural language processing: topic-modeling history
Topic modeling: embedding documents1
03078090707907
00790752700578
94071006000797
00970008007000
10000400400090
00050205008000
documents
the
Python
performance
profiling
module
is
code
can
a
→
03078090707907
00790752700578
94071006000797
topics
the
Python
performance
profiling
module
is
code
can
a
030
007
940
009
100
000
documents
topics
+
What terms
are in a topics
What documents
are in a topics
LSA (Latent Semantic Analysis) [Landauer... 1998]
SVD2 of the terms×documents matrix
1Typically for information retrieval purpose, aka search engines
2Later: refinements for more complex loss: LDA (Latent Dirichlet Allocation)
[Blei... 2003] and Gamma Poisson [Canny 2004].G Varoquaux 58
100. Word embeddings
Distributional semantics: meaning of words
“You shall know a word by the company it keeps”
Firth, 1957
Example: A glass of red , please
Could be wine maybe juice?
wine and juice have related meanings
Factorization of the word×context matrix
What choice of context?
What loss?
word2vec [Mikolov... 2013a] glove [Pennington... 2014]
G Varoquaux 59
101. Word2vec: skip-gram sampling [Mikolov... 2013b]
{ ˆuw, ˆvc} = argmax
{uw,vc}
pairs of words (w, c)
in the same window1
log softmax(V uT
w)c
softmax(z)i =
exp zi
j exp zj
uw ∈ k: embedding of word w
V ∈ card(voc)×k: [vc, c ∈ voc]
all context words
Big sum on contexts
⇒ solved by SGD2
salad
meat
juice
wine
glass
green
red
Center
word
U:wordembedding
salad
meat
juice
wine
glass
red
green
Context
word
V:contextembedding
Other view:
Language models
Prediction of words
1Efficient: never build the matrix, stream directly from text.
2These windows are called skip gramG Varoquaux 60
102. Word2vec: negative sampling [Mikolov... 2013a]
Costly loss: log softmax(z)i = log
exp zi
j exp zj
Approximate1 Huge sum in softmax (all vocabulary)
Downsample it by drawing the positive (numerator)
and a few negative examples (denominator)
Negative sampling loss2:
[Goldberg and Levy 2014] log σ(vc uT
w) +
nneg words w
not in window
log σ(−vcuw )
σ: sigmoid (log σ(z) = −1 − exp −z)
1Related to noise contrastive estimate, that avoid computing costly
normalizations in likelihoods [Gutmann and Hyv¨arinen 2010]
2Related to a matrix factorization of mutual information inword occurence
[Levy and Goldberg 2014]G Varoquaux 61
103. Beyond natural language: metric learning
Triplet loss
For a “anchor”, b close to a, c far from a:
log σ(vT
aub) − log σ(vT
auc)
Quadruplet loss [Chen... 2017]
For a and b close by, c and d far appart:
log σ(vT
aub) − log σ(vT
cud)
In practice: draw1 randomly (a, b, c) or (a, b, c, d)
Metric learning: [Bellet... 2013]
Learning embeddings with weak supervision
1Many strategies, eg “hard negative mining”, requires a good test set and
metric to set, as with SGD hyperparameters.G Varoquaux 62
104. Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
G Varoquaux 63
105. Embedding entities in knowledge graphs
Structured (graph) represen-
tation of human knowledge
eg dbpedia, Yago
Learning embeddings of enti-
ties {ei} and relations {rj}:
ea ∼ eb + rc
a model of the relation
Then triplet / quadruplet loss Reuse existing:
conceptnet.io
G Varoquaux 63
[Bordes... 2013,
Wang... 2017]
106. The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
G Varoquaux 64
107. The value of simple models
Risk of invisible overfit dur-
ing search for hyperparameters
and models
Complex models call for a clear
utility measure with low mea-
surement error
Many reliable labels
Matrix factorization models1: 2 hyper parameters:
Dimensionality k Regularization λ
Set them to optimize representations for supervised problems
1Using majorization-minimization approaches to avoid learning rateG Varoquaux 64
108. Summary
Discrete entities lead to counting occurences
⇒ Poisson and logistic loss (ugly logs in equations)
Word & entity embeddings
Factorization of coocurrences in a notion of context
more generally: metric learning
Limited-data settings:
Avoid negative-sampling models ( hyper-parameters)
Try to reuse representations (fastext, conceptnet.io)
G Varoquaux 65
109. 3 Fisher kernels
What if the objects studied do not naturally
live in a vector space?
eg graphs of varying number of nodes
111. Learning with Kernels [Scholkopf and Smola 2001]
Kernels
A kernel K is a function: X × X → +
positive symmetric
It captures similarity between observations
Building functions with kernels
on the training data:
Ki
def
= K(xi, ·) i ∈ train
prediction function2:
f(x) =
i∈train
wi Ki(x)
2Benefits of this formulation: i) non-linear predictor trained with linear
problem; ii) expressiveness that increases with amount of training dataG Varoquaux 68
112. Feature maps [Scholkopf and Smola 2001]
Drawbacks of kernels
Compute cost O(n2)
Representations not explicit
f(x) =
i∈train
wi Ki(x)
As K is symmetric positive1,
φ : X → d , such that x, x K(x, x ) = φ(x)Tφ(x )
φ is a “feature map”
f(x) is a linear function of φ(x)
but d can be ∞
Approximate φ
1Think of it as a generalization of the Cholesky decompositionG Varoquaux 69
113. Nystr¨om approximate feature maps
[Drineas and Mahoney 2005]
On a random subset of the training data:
G
def
=
K(x1, x1) . . . K(x1, xm)
... .
...
K(xm, x1) . . . K(xm, xm)
∈ Rm×m
Let L ∈ k×m rank-k approximation LTL
rank−k
≈ G−1
Feature map1
φNystrom(x) =
K(x1, x)
...
K(xm, x)
LT
sklearn.kernel approximation.Nystroem
1Exercise: check that φNystrom(x)TφNystrom(x) ≈ φ(x)Tφ(x) for x in our subset.G Varoquaux 70
114. Nystr¨om approximate feature maps
[Drineas and Mahoney 2005]
On a random subset of the training data:
G
def
=
K(x1, x1) . . . K(x1, xm)
... .
...
K(xm, x1) . . . K(xm, xm)
∈ Rm×m
Let L ∈ k×m rank-k approximation LTL
rank−k
≈ G−1
Feature map φNystrom(x) =
K(x1, x)
...
K(xm, x)
LT
sklearn.kernel approximation.Nystroem
See also: Random features [Rahimi and Recht 2008]
sklearn.kernel approximation.RBFSampler
G Varoquaux 70
116. Parametric generative model
Consider a model of x parametrized by w ∈ k:
(x) = Pw(x) log-likelihood LP
def
= log Pw
Maximum likelihood estimates: ˆw = argmaxw LP(x)
Kullback-Leibler divergence
Natural distance1 to another distribution
KL(P|Q) = ¾P[LP − LQ]
Goal:
Benefit from our model to build a representation
All models are wrong but some are useful
1Not a distance, technically, as not symmetric.G Varoquaux 72
117. Local behavior of parametric models
Fisher information matrix
Expectation of Hessian of L given w:
I(w)
def
= ¾
∂2
∂2w
L(x|w) w ∈ k×k
Order-2 approximation of KL divergence:
KL(Pw|Pw+ ) = TIw
Iw also scales the covariance of the estimation
error on maximum-likelihood estimates of w
(Cramer-Rao bounds)
G Varoquaux 73
( )wI
118. Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
KL close to w1
KL close to w2
G Varoquaux 74
119. Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
KL close to w1
KL close to w2
Non constant across
the family of distri-
butions
{Pw, w ∈ k}
G Varoquaux 74
120. Fisher-Rao manifold (information geometry)
Order-2 approximation of KL(Pw|Pw+ ) = TIw
{Pw, w ∈ k} form a Riemannian
manifold, with I as the metric tensor
[Rao 1945]G Varoquaux 74
121. Remannian manifolds
Continuous geometry on curved spaces (eg the Earth)
Locally, but not globally, Euclidean
A Riemmannian manifold M is a
differentiable space endowed
with a metric d that is locally
equivalent to a Euclidean vector
space:
ξ
MT
M
M M
M'
LogM
ExpM
for M and M ∈ M, if d(M, M ) → 0, M and M can
be mapped to elements of a vector space m, m
such that d(M, M ) ∼ mTm
Global structure: geodesic distance
G Varoquaux 75
122. Fisher Kernel [Jaakkola and Haussler 1999]
A Kernel locally equivalent to the KL divergence
Build upon the Fisher matrix
Create a feature map
Vector space where Euclidean distance ≈ KL
⇒
G Varoquaux 76
123. Fisher Kernel [Jaakkola and Haussler 1999]
A Kernel locally equivalent to the KL divergence
Build upon the Fisher matrix
Create a feature map
Vector space where Euclidean distance ≈ KL
⇒
In practice:
1. Fit model Pw on train data:
ˆw ← argmax
w
i∈train
L(xi, w)
2. Compute gradient on w of likelihood for ˆw:
zFisher(x) = wL(x, ˆw) ∈ k
G Varoquaux 76
124. Fisher Kernel applications
Text: TF-IDF [Elkan 2005]
Multinomial model of word appearance
Genomics [Jaakkola and Haussler 1999]
Hidden Markov model of DNA sequences
(variable-length sequences ⇒ encoding difficult)
Tree-structured data [Nicotra... 2004]
A transition model on the tree
Brain connectivity [Varoquaux... 2010]
Multivariate Gaussian model (covariances)
G Varoquaux 77
125. Summary
Kernels build prediction functions on similarities
Features maps / kernel approximation captures the
corresponding representation
Fisher Kernels can go from likelihood to vector space
Very useful for non numeric objects
G Varoquaux 78
126. Limited-data settings
Reminder: Your valida-
tion measure is intrinsi-
cally unreliable
(sampling noise)
Get more data
For instance acquiring data
on a related task, to learn
representations
Use simple models
Do not spend too much
time tweaking 20% 10% 0% +10% +20%
Distribution of errors under a binomial law
1000
300
200
100
30
Number of available samples
2% +2%
4% +4%
5% +5%
7% +7%
15% +12%
G Varoquaux 79[Varoquaux 2018]
127. References I
A. Abraham, M. P. Milham, A. Di Martino, R. C. Craddock, D. Samaras,
B. Thirion, and G. Varoquaux. Deriving reproducible biomarkers
from multi-site resting-state data: an autism-based example.
NeuroImage, 147:736, 2017.
A. Achille and S. Soatto. Emergence of invariance and
disentanglement in deep representations. The Journal of Machine
Learning Research, 19(1):1947–1980, 2018.
A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for
feature vectors and structured data. arXiv preprint
arXiv:1306.6709, 2013.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
G Varoquaux 80
128. References II
A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko.
Translating embeddings for modeling multi-relational data. In
Advances in Neural Information Processing Systems, pages
2787–2795, 2013.
D. Bzdok, M. Eickenberg, O. Grisel, B. Thirion, and G. Varoquaux.
Semi-supervised factored logistic regression for
high-dimensional neuroimaging data. In Advances in Neural
Information Processing Systems, page 3348, 2015.
J. Canny. Gap: A factor model for discrete data. In SIGIR, page 122,
2004.
J.-F. Cardoso. Dependence, correlation and gaussianity in
independent component analysis. Journal of Machine Learning
Research, 4:1177, 2003.
P. Cerda and G. Varoquaux. Encoding high-cardinality string
categorical variables. arXiv:1907.01860, 2019.
G Varoquaux 81
129. References III
W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep
quadruplet network for person re-identification. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, page 403, 2017.
P. S. Dhillon, D. P. Foster, S. M. Kakade, and L. H. Ungar. A risk
comparison of ordinary least squares vs ridge regression. The
Journal of Machine Learning Research, 14:1505, 2013.
P. Drineas and M. W. Mahoney. On the nystr¨om method for
approximating a gram matrix for improved kernel-based learning.
journal of machine learning research, 6:2153, 2005.
C. Elkan. Deriving tf-idf as a fisher kernel. In International
Symposium on String Processing and Information Retrieval, page
295, 2005.
G Varoquaux 82
130. References IV
Y. Goldberg and O. Levy. word2vec explained: Deriving mikolov et
al.’s negative-sampling word-embedding method. arXiv:1402.3722,
2014.
P. K. Gopalan, L. Charlin, and D. Blei. Content-based
recommendations with poisson factorization. In Advances in
Neural Information Processing Systems, page 3176, 2014.
M. Gutmann and A. Hyv¨arinen. Noise-contrastive estimation: A new
estimation principle for unnormalized statistical models. In
Proceedings of the International Conference on Artificial
Intelligence and Statistics, page 297, 2010.
D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge
regression. Foundations of Computational Mathematics, 14, 2014.
A. Hyv¨arinen and E. Oja. Independent component analysis:
algorithms and applications. Neural networks, 13(4):411, 2000.
G Varoquaux 83
131. References V
A. J. Izenman. Reduced-rank regression for the multivariate linear
model. Journal of multivariate analysis, 5:248, 1975.
T. Jaakkola and D. Haussler. Exploiting generative models in
discriminative classifiers. In Advances in neural information
processing systems, pages 487–493, 1999.
T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent
semantic analysis. Discourse processes, 25:259, 1998.
A. Lefevre, F. Bach, and C. F´evotte. Online algorithms for
nonnegative matrix factorization with the itakura-saito
divergence. In Applications of Signal Processing to Audio and
Acoustics (WASPAA), page 313. IEEE, 2011.
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix
factorization. In Advances in neural information processing
systems, page 2177, 2014.
G Varoquaux 84
132. References VI
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix
factorization and sparse coding. Journal of Machine Learning
Research, 11:19–60, 2010.
J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and
vision processing. Foundations and Trends® in Computer
Graphics and Vision, 8(2-3):85–283, 2014.
S. Mallat. Understanding deep convolutional networks.
Philosophical Transactions of the Royal Society A, 374:20150203,
2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66:113, 2017.
G Varoquaux 85
133. References VII
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Extracting
universal representations of cognition across brain-imaging
studies. arXiv preprint arXiv:1809.06035, 2018.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction problems.
ACM SIGKDD Explorations Newsletter, 3:27, 2001.
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of
word representations in vector space. In ICLR Workshop Papers.
2013a.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems, page 3111, 2013b.
G Varoquaux 86
134. References VIII
G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of
linear regions of deep neural networks. In Advances in neural
information processing systems, page 2924, 2014.
L. Nicotra, A. Micheli, and A. Starita. Fisher kernel for tree structured
data. In 2004 IEEE International Joint Conference on Neural
Networks (IEEE Cat. No. 04CH37541), volume 3, pages 1917–1922.
IEEE, 2004.
E. Oyallon, E. Belilovsky, and S. Zagoruyko. Scaling the scattering
transform: Deep hybrid networks. In Proceedings of the IEEE
international conference on computer vision, page 5618, 2017.
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for
word representation. In Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), page
1532, 2014.
G Varoquaux 87
135. References IX
M. Rahim, B. Thirion, D. Bzdok, I. Buvat, and G. Varoquaux. Joint
prediction of multiple scores captures better individual traits
from brain images. Neuroimage, 158:145–154, 2017a.
M. Rahim, B. Thirion, and G. Varoquaux. Multi-output predictions
from neuroimaging: assessing reduced-rank linear models. In
2017 International Workshop on Pattern Recognition in
Neuroimaging (PRNI), pages 1–4. IEEE, 2017b.
A. Rahimi and B. Recht. Random features for large-scale kernel
machines. In Advances in neural information processing systems,
pages 1177–1184, 2008.
C. Rao. Information and accuracy attainable in the estimation of
statistical parameters. Bull Calcutta. Math. Soc., 37:81, 1945.
G Varoquaux 88
136. References X
S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression:
Bias-variance decompositions, covariance penalties, and
prediction error estimation. Journal of the American Statistical
Association, pages 1–14, 2018.
B. Scholkopf and A. J. Smola. Learning with kernels: support vector
machines, regularization, optimization, and beyond. MIT press,
2001.
G. Varoquaux. Cross-validation failure: small sample sizes lead to
large error bars. Neuroimage, 180:68–77, 2018.
G. Varoquaux, F. Baronnet, A. Kleinschmidt, P. Fillard, and B. Thirion.
Detection of brain functional-connectivity difference in
post-stroke patients using group-level covariance modeling. In
International Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 200–208. Springer, 2010.
G Varoquaux 89
137. References XI
Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding:
A survey of approaches and applications. IEEE Transactions on
Knowledge and Data Engineering, 29(12):2724–2743, 2017.
G Varoquaux 90