These slides are a one-hour course on machine learning with non-curated data.
According to industry surveys, the number one hassle of data scientists is cleaning the data to analyze it. Here, I survey what "dirtyness" forces time-consuming cleaning. We will then cover two specific aspects of dirty data: non-normalized entries and missing values. I show how, for these two problems, machine-learning practice can be adapted to work directly on a data table without curation. The normalization problem can be tackled by adapting methods from natural language processing. The missing-values problem will lead us to revisit classic statistical results in the setting of supervised learning.
Representation learning in limited-data settingsGael Varoquaux
A 4-hour long didactic course on simple notions of representations and how to use them in limited-data settings:
- A supervised learning point of view, giving intuitions and math on what are representations are why they matter
- Building simple unsupervised learning models to extract representation: from matrix decomposition for signals to embeddings of entities
- Evaluating models in limited-data settings, often a bottleneck
This slide-deck was given as a course at the 2021 DeepLearn summer school.
MLSEV. Cluster Analysis and Anomaly DetectionBigML, Inc
Unsupervised Learning (Part I), by BigML:
*Cluster Analysis: Finding Similarities
*Anomaly Detection: Finding the Unusual
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Evaluating machine learning models and their diagnostic valueGael Varoquaux
Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
This Support Vector Machine (SVM) presentation will help you understand Support Vector Machine algorithm, a supervised machine learning algorithm which can be used for both classification and regression problems. This SVM presentation will help you learn where and when to use SVM algorithm, how does the algorithm work, what are hyperplanes and support vectors in SVM, how distance margin helps in optimizing the hyperplane, kernel functions in SVM for data transformation and advantages of SVM algorithm. At the end, we will also implement Support Vector Machine algorithm in Python to differentiate crocodiles from alligators for a given dataset.
Below topics are explained in this Support Vector Machine presentation:
1. What is Machine Learning?
2. Why support vector machine?
3. What is support vector machine?
4. Understanding support vector machine
5. Advantages of support vector machine
6. Use case in Python
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, Naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Representation learning in limited-data settingsGael Varoquaux
A 4-hour long didactic course on simple notions of representations and how to use them in limited-data settings:
- A supervised learning point of view, giving intuitions and math on what are representations are why they matter
- Building simple unsupervised learning models to extract representation: from matrix decomposition for signals to embeddings of entities
- Evaluating models in limited-data settings, often a bottleneck
This slide-deck was given as a course at the 2021 DeepLearn summer school.
MLSEV. Cluster Analysis and Anomaly DetectionBigML, Inc
Unsupervised Learning (Part I), by BigML:
*Cluster Analysis: Finding Similarities
*Anomaly Detection: Finding the Unusual
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.
Evaluating machine learning models and their diagnostic valueGael Varoquaux
Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
This Support Vector Machine (SVM) presentation will help you understand Support Vector Machine algorithm, a supervised machine learning algorithm which can be used for both classification and regression problems. This SVM presentation will help you learn where and when to use SVM algorithm, how does the algorithm work, what are hyperplanes and support vectors in SVM, how distance margin helps in optimizing the hyperplane, kernel functions in SVM for data transformation and advantages of SVM algorithm. At the end, we will also implement Support Vector Machine algorithm in Python to differentiate crocodiles from alligators for a given dataset.
Below topics are explained in this Support Vector Machine presentation:
1. What is Machine Learning?
2. Why support vector machine?
3. What is support vector machine?
4. Understanding support vector machine
5. Advantages of support vector machine
6. Use case in Python
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, Naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Expert Session delivered during Workshop on
Image Processing and Machine Learning for Pattern Recoginition on 11th July 2016 at
University Institute of Engineering and Technology, Chandigarh
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
YouTube Link: https://youtu.be/XcLO4f1i4Yo
** Data Science Certification using R: https://www.edureka.co/data-science **
This session on Statistics And Probability will cover all the fundamentals of stats and probability along with a practical demonstration in the R language.
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
This Random Forest Algorithm Presentation will explain how Random Forest algorithm works in Machine Learning. By the end of this video, you will be able to understand what is Machine Learning, what is classification problem, applications of Random Forest, why we need Random Forest, how it works with simple examples and how to implement Random Forest algorithm in Python.
Below are the topics covered in this Machine Learning Presentation:
1. What is Machine Learning?
2. Applications of Random Forest
3. What is Classification?
4. Why Random Forest?
5. Random Forest and Decision Tree
6. Comparing Random Forest and Regression
7. Use case - Iris Flower Analysis
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Introduction to Statistical Machine Learningmahutte
This course provides a broad introduction to the methods and practice
of statistical machine learning, which is concerned with the development
of algorithms and techniques that learn from observed data by
constructing stochastic models that can be used for making predictions
and decisions. Topics covered include Bayesian inference and maximum
likelihood modeling; regression, classi¯cation, density estimation,
clustering, principal component analysis; parametric, semi-parametric,
and non-parametric models; basis functions, neural networks, kernel
methods, and graphical models; deterministic and stochastic
optimization; over¯tting, regularization, and validation.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
YouTube Link: https://youtu.be/aGu0fbkHhek
** Data Science Master Program: https://www.edureka.co/masters-program/data-scientist-certification **
This Edureka PPT on "Data Science Full Course" provides an end to end, detailed and comprehensive knowledge on Data Science. This Data Science PPT will start with basics of Statistics and Probability and then moves to Machine Learning and Finally ends the journey with Deep Learning and AI. For Data-sets and Codes discussed in this PPT, drop a comment.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Simplilearn
This Machine Learning Vs Deep Learning Vs Artificial Intelligence presentation will help you understand the differences between Machine Learning, Deep Learning and Artificial Intelligence, and how they are related to each other. The presentation will also cover what Machine Learning, Deep Learning, and Artificial Intelligence entail, how they work with the help of examples, and whether they really are all that different.
This Machine Learning Vs Deep Learning Vs Artificial Intelligence presentation will explain the topics listed below:
1. Artificial Intelligence example
2. Machine Learning example
3. Deep Learning example
4. Human Vs Artificial Intelligence
5. How Machine Learning works
6. How Deep Learning works
7. AI Vs Machine Learning Vs Deep Learning
8. AI with Machine Learning and Deep Learning
9. Real-life examples
10. Types of Artificial Intelligence
11. Types of Machine Learning
12. Comparing Machine Learning and Deep Learning
13. A glimpse into the future
- - - - - - - -
About Simplilearn Artificial Intelligence Engineer course:
What are the learning objectives of this Artificial Intelligence Course?
By the end of this Artificial Intelligence Course, you will be able to accomplish the following:
1. Design intelligent agents to solve real-world problems which are search, games, machine learning, logic constraint satisfaction problems, knowledge-based systems, probabilistic models, agent decision making
2. Master TensorFlow by understanding the concepts of TensorFlow, the main functions, operations and the execution pipeline
3. Acquire a deep intuition of Machine Learning models by mastering the mathematical and heuristic aspects of Machine Learning
4. Implement Deep Learning algorithms, understand neural networks and traverse the layers of data abstraction which will empower you to understand data like never before
5. Comprehend and correlate between theoretical concepts and practical aspects of Machine Learning
6. Master and comprehend advanced topics like convolutional neural networks, recurrent neural networks, training deep networks, high-level interfaces
- - - - - -
Why be an Artificial Intelligence Engineer?
1. The average salary for a professional with an AI certification is $110k a year in the USA according to Indeed.com. The need for AI specialists exists in just about every field as companies seek to give computers the ability to think, learn, and adapt
2. In India, an Engineer with AI certification and minimal experience in the field commands a salary of Rs.17 lacs - Rs. 25 lacs, while it can go up to Rs. 50 lacs - Rs.1 crore per annum for a professional with 8-10 years of experience
3. The scarcity of people with artificial intelligence training is such that one report says there are only around 10000 such experts and companies like Google and Facebook are paying a salary of over $5,00,000 per annum
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
This Edureka Random Forest tutorial will help you understand all the basics of Random Forest machine learning algorithm. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn random forest analysis along with examples. Below are the topics covered in this tutorial:
1) Introduction to Classification
2) Why Random Forest?
3) What is Random Forest?
4) Random Forest Use Cases
5) How Random Forest Works?
6) Demo in R: Diabetes Prevention Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
A tutorial on machine learning to build prediction models with missing values.
The slides cover both theoretical results (statistical learning) and practical advice, with a focus on implementation in Python with scikit-learn
This lecture presented at Remote Sensing, Uncertainty Quantification and a Theory of Data Systems Workshop - Cahill Center, California Institute of Technology
Expert Session delivered during Workshop on
Image Processing and Machine Learning for Pattern Recoginition on 11th July 2016 at
University Institute of Engineering and Technology, Chandigarh
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
YouTube Link: https://youtu.be/XcLO4f1i4Yo
** Data Science Certification using R: https://www.edureka.co/data-science **
This session on Statistics And Probability will cover all the fundamentals of stats and probability along with a practical demonstration in the R language.
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
This Random Forest Algorithm Presentation will explain how Random Forest algorithm works in Machine Learning. By the end of this video, you will be able to understand what is Machine Learning, what is classification problem, applications of Random Forest, why we need Random Forest, how it works with simple examples and how to implement Random Forest algorithm in Python.
Below are the topics covered in this Machine Learning Presentation:
1. What is Machine Learning?
2. Applications of Random Forest
3. What is Classification?
4. Why Random Forest?
5. Random Forest and Decision Tree
6. Comparing Random Forest and Regression
7. Use case - Iris Flower Analysis
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Introduction to Statistical Machine Learningmahutte
This course provides a broad introduction to the methods and practice
of statistical machine learning, which is concerned with the development
of algorithms and techniques that learn from observed data by
constructing stochastic models that can be used for making predictions
and decisions. Topics covered include Bayesian inference and maximum
likelihood modeling; regression, classi¯cation, density estimation,
clustering, principal component analysis; parametric, semi-parametric,
and non-parametric models; basis functions, neural networks, kernel
methods, and graphical models; deterministic and stochastic
optimization; over¯tting, regularization, and validation.
Data Mining: Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingSalah Amean
the chapter contains :
Data Preprocessing: An Overview,
Data Quality,
Major Tasks in Data Preprocessing,
Data Cleaning,
Data Integration,
Data Reduction,
Data Transformation and Data Discretization,
Summary.
YouTube Link: https://youtu.be/aGu0fbkHhek
** Data Science Master Program: https://www.edureka.co/masters-program/data-scientist-certification **
This Edureka PPT on "Data Science Full Course" provides an end to end, detailed and comprehensive knowledge on Data Science. This Data Science PPT will start with basics of Statistics and Probability and then moves to Machine Learning and Finally ends the journey with Deep Learning and AI. For Data-sets and Codes discussed in this PPT, drop a comment.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Simplilearn
This Machine Learning Vs Deep Learning Vs Artificial Intelligence presentation will help you understand the differences between Machine Learning, Deep Learning and Artificial Intelligence, and how they are related to each other. The presentation will also cover what Machine Learning, Deep Learning, and Artificial Intelligence entail, how they work with the help of examples, and whether they really are all that different.
This Machine Learning Vs Deep Learning Vs Artificial Intelligence presentation will explain the topics listed below:
1. Artificial Intelligence example
2. Machine Learning example
3. Deep Learning example
4. Human Vs Artificial Intelligence
5. How Machine Learning works
6. How Deep Learning works
7. AI Vs Machine Learning Vs Deep Learning
8. AI with Machine Learning and Deep Learning
9. Real-life examples
10. Types of Artificial Intelligence
11. Types of Machine Learning
12. Comparing Machine Learning and Deep Learning
13. A glimpse into the future
- - - - - - - -
About Simplilearn Artificial Intelligence Engineer course:
What are the learning objectives of this Artificial Intelligence Course?
By the end of this Artificial Intelligence Course, you will be able to accomplish the following:
1. Design intelligent agents to solve real-world problems which are search, games, machine learning, logic constraint satisfaction problems, knowledge-based systems, probabilistic models, agent decision making
2. Master TensorFlow by understanding the concepts of TensorFlow, the main functions, operations and the execution pipeline
3. Acquire a deep intuition of Machine Learning models by mastering the mathematical and heuristic aspects of Machine Learning
4. Implement Deep Learning algorithms, understand neural networks and traverse the layers of data abstraction which will empower you to understand data like never before
5. Comprehend and correlate between theoretical concepts and practical aspects of Machine Learning
6. Master and comprehend advanced topics like convolutional neural networks, recurrent neural networks, training deep networks, high-level interfaces
- - - - - -
Why be an Artificial Intelligence Engineer?
1. The average salary for a professional with an AI certification is $110k a year in the USA according to Indeed.com. The need for AI specialists exists in just about every field as companies seek to give computers the ability to think, learn, and adapt
2. In India, an Engineer with AI certification and minimal experience in the field commands a salary of Rs.17 lacs - Rs. 25 lacs, while it can go up to Rs. 50 lacs - Rs.1 crore per annum for a professional with 8-10 years of experience
3. The scarcity of people with artificial intelligence training is such that one report says there are only around 10000 such experts and companies like Google and Facebook are paying a salary of over $5,00,000 per annum
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Edureka!
This Edureka Random Forest tutorial will help you understand all the basics of Random Forest machine learning algorithm. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts, learn random forest analysis along with examples. Below are the topics covered in this tutorial:
1) Introduction to Classification
2) Why Random Forest?
3) What is Random Forest?
4) Random Forest Use Cases
5) How Random Forest Works?
6) Demo in R: Diabetes Prevention Use Case
You can also take a complete structured training, check out the details here: https://goo.gl/AfxwBc
A tutorial on machine learning to build prediction models with missing values.
The slides cover both theoretical results (statistical learning) and practical advice, with a focus on implementation in Python with scikit-learn
This lecture presented at Remote Sensing, Uncertainty Quantification and a Theory of Data Systems Workshop - Cahill Center, California Institute of Technology
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
This talk describe our efforts to bring easily usable machine learning to brain mapping. It covers both questions that machine learning can answer as well as two softwares developed to facilitate machine learning and it's application to neuroimaging.
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...luizcelsojr
A palestra descreve a área de Ciência de Dados e dá exemplos de diversas aplicações multi-modelos (tabelas, texto e grafos) e multi-disciplinares (biologia, enfermagem, educação).
Null-values imputation using different modification random forest algorithmIAESIJAI
Today, the world lives in the era of information and data. Therefore, it has become vital to collect and keep them in a database to perform a set of processes and obtain essential details. The null value problem will appear through these processes, which significantly influences the behaviour of processes such as analysis and prediction and gives inaccurate outcomes. In this concern, the authors decide to utilise the random forest technique by modifying it to calculate the null values from datasets got from the University of California Irvine (UCL) machine learning repository. The
database of this scenario consists of connectionist bench, phishing websites, breast cancer, ionosphere, and COVID-19. The modified random forest algorithm is based on three matters and three number of null values. The samples chosen are founded on the proposed less redundancy bootstrap. Each tree has distinctive features depending on hybrid features selection. The final effect is considered based on ranked voting for classification. This scenario found that the modified random forest algorithm executed more suitable accuracy results than the traditional algorithm as it relied on four parameters and got sufficient accuracy in imputing the null value, which is grown by 9.5%, 6.5%, and 5.25% of one, two and three null values in the
same row of datasets, respectively.
The consumer product landscape, particularly among e-commerce firms, includes a bevy of subscription-based business models. Internet and mobile phone subscriptions are now commonplace and joining the ranks are dietary supplements, meals, clothing, cosmetics and personal grooming products.
Standard metrics to diagnose a healthy consumer-brand relationship typically include customer purchase frequency and ultimately, retention of the customer demonstrated by regular purchases. If a brand notices that a customer isn’t purchasing, it may consider targeting the customer with discount offers or deploying a tailored messaging campaign in the hope that the customer will return and not “churn”.The churn diagnosis, however, becomes more complicated for subscription-based products, many of which offer multiple delivery frequencies and the ability to pause a subscription. Brands with subscription-based products need to have some reliable measure of churn propensity so they can further isolate the factors that lead to churn and preemptively identify at-risk customers.
Measuring mental health with machine learning and brain imagingGael Varoquaux
The study of mental health relies vastly on behavior testing and questionnaires. I discuss how
machine learning on large brain-imaging cohorts can open new alleys for markers of mental health. My
claims are that challenges are the amount of diagnosed conditions rather than heterogeneity of the
conditions and that we should turn to proxy labels. I discuss another fundamental challenge to this
agenda: the external and construct validity of brain-imaging based markers.
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
My current thoughts about methods validity and design in brain imaging.
Data processing is a significant part of a neuroimaging study. The choice of corresponding methods and tools is crucial. I will give an opinionated view how on a path to building better data processing for neuroimaging. I will take examples on endeavors that I contributed to: defining standards for functional-connectivity analysis, the nilearn neuroimaging tool, the scikit-learn machine-learning toolbox -an industry standard with a million regular users. I will cover not only the technical process -statistics, signal processing, software engineering- but also the epistemology of methods development. Methods govern our results, they are more than a technical detail.
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
Extracting Functional-Connectome Biomarkers with Machine Learning: a talk in the symposium on how do current predictive connectivity models meet clinician’s needs?
This talk is a bit provocative and first sets visions, before bringing a few technical suggestions
Atlases of cognition with large-scale human brain mappingGael Varoquaux
Cognitive neuroscience uses neuroimaging to identify brain systems engaged in specific cognitive tasks. However, linking unequivocally brain systems with cognitive functions is difficult: each task probes only a small number of facets of cognition, while brain systems are often engaged in many tasks. We develop a new approach to generate a functional atlas of cognition, demonstrating brain systems selectively associated with specific cognitive functions. This approach relies upon an ontology that defines specific cognitive functions and the relations between them, along with an analysis scheme tailored to this ontology. Using a database of thirty neuroimaging studies, we show that this approach provides a highly-specific atlas of mental functions, and that it can decode the mental processes engaged in new tasks.
Machine learning for functional connectomesGael Varoquaux
A tutorial on using machine-learning for functional-connectomes, for instance on resting-state fMRI. This is typically useful for population imaging: comparing traits or conditions across subjects.
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
Informatics in the psychological sciences brings fascinating challenges as mental processes or pathologies have fuzzy definition and are hard to quantify. Brain imaging brings rich data on the neural substrate of these concepts, yet it is a non trivial link.
The goal of this presentation is to put forward basic ideas of "psychoinformatics", using advanced processing on brain images to quantify better the elements of psychology.
It discusses how machine learning can bridge brain images to behavior: to describe better mental processes involved in brain activity, or to extract biomarkers of pathologies, individual traits, or cognition.
Simple representations for learning: factorizations and similarities Gael Varoquaux
Real-life data seldom comes in the ideal form for statistical learning.
This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.
Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].
With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.
[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128
[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
Machine learning builds predictive models from the data. It is massive used on medical images these days, for a variety of applications ranging from segmentation to diagnosis.
This is an introductory tutorial to machine learning from giving intuitions on the statistical point of view. It introduce the methodology, the concepts behind the central models, the validation framework and some caveats to look for.
It also discusses some applications to drawing conclusions from brain imaging, and use these applications to highlight various technical aspects to running machine learning models on high-dimensional data such as medical imaging.
Computational practices for reproducible scienceGael Varoquaux
Reconciling bleeding-edge scientific results and reproducible research may seem a conundrum in our fast-paced high-pressure academic world. I discuss the practices that I found useful in computational work. At a high level, it is important to navigate the space between rapid experimentation and industrial-grade software development. I advocate adopting more and more software-engineering best practices as a project matures. I will also discuss how to turn the computational work into libraries, and to ensure the quality of the resulting libraries. And I conclude on how those libraries need to fit in the larger picture of the exercise of research to give better science.
Slides for my keynote at Scipy 2017
https://youtu.be/eVDDL6tgsv8
Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?
In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
Talk given at the OHBM 2017 education course.
I present the challenges and techniques to estimating meaningful brain functional connectomes from fMRI: why sparsity in inverse covariance leads to models that can interpreted as interactions between regions.
Then I discuss the limitations of sparse estimators and introduce shrinkage as an alternative. Finally, I discuss how to compare multiple functional connectomes.
Data science calls for rapid experimentation and building intuitions from the data. Yet, data science also underpins crucial decisions and operational logic. Writing production-ready and robust statistical analysis without cognitive overhead may seem a conundrum. I will explore simple, and less simple, practices for fast turn around and consolidation of data-science code. I will discuss how these considerations led to the design of scikit-learn, that enables easy machine learning yet is used in production. Finally, I will mention some scikit-learn gems, new or forgotten.
Scientist meets web dev: how Python became the language of dataGael Varoquaux
Python started as a scripting language, but now it is the new trend everywhere and in particular for data science, the latest rage of computing. It didn’t get there by chance: tools and concepts built by nerdy scientists and geek sysadmins provide foundations for what is said to be the sexiest job: data scientist.
In this talk I give a personal perspective on the progress of the scientific Python ecosystem, from numerical physics to data mining. What made Python suitable for science; Why the cultural gap between scientific Python and the broader Python community turned out to be a gold mine; And where this richness might lead us.
The talk will discuss low-level and high-level technical aspects, such as how the Python world makes it easy to move large chunks of number across code. It will touch upon current technical details that make scikit-learn and joblib stand.
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
Machine learning is geared towards prediction. However, aside diagnosis or prognosis in the clinics, cognitive neuroimaging strives for uncovering insights from the data, rather than minimizing prediction error. I review various inferences on brain function that have been drawn using pattern recognition techniques, focusing on decoding. In particular, I discuss using generalization as a test for information, multivariate analysis to interpret overlapping activation patterns, and decoding for principled reverse inference. I give each time a statistical view and a cognitive imaging view.
Talk giving at PRNI 2016 for the paper https://arxiv.org/pdf/1606.06439v1.pdf
Abstract — Spatially-sparse predictors are good models for
brain decoding: they give accurate predictions and their weight
maps are interpretable as they focus on a small number of
regions. However, the state of the art, based on total variation or
graph-net, is computationally costly. Here we introduce sparsity
in the local neighborhood of each voxel with social-sparsity, a
structured shrinkage operator. We find that, on brain imaging
classification problems, social-sparsity performs almost as well as
total-variation models and better than graph-net, for a fraction
of the computational cost. It also very clearly outlines predictive
regions. We give details of the model and the algorithm
Personal point of view on scikit-learn: past, present, and future.
This talks gives a bit of history, mentions exciting development, and a personal vision on the future.
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
We present an automated pipeline to learn predictive biomarkers from resting-state fMRI. We apply it to classifying autism on unseen sites, demonstrating the feasibility of biomarkers on weakly standardized functional imaging data.
We study the steps of the pipeline that are important to predict and can show that 1) the choice of atlas is the most important choice. Ideally the atlas should be made of functional regions learned from the data. 2) "tangent space" parametrization of the connectivity is the best performer.
We conclude on general recommendations for predictive biomarkers from resting-state fMRI
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
Pattern Recognition for NeuroImaging (PR4NI)
We will show empirically how the pattern recognition techniques-commonly used, such as SVMs, provide low-quality brain maps, eventhough they give very good prediction accuracy. We will give an overview of recently developed techniques to impose priors on patterns particularly well suited to neuroimaging: selecting a small number of spatially-structured predictive brain regions. These tools reconcile machine learning with
brain mapping by giving maps more useful to draw neuroscientific conclusions. In addition, they are more robust to cross-individuals spatial variability and thus generalize well across subjects.
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
Scikit-learn is a popular machine learning tool. What can it do for you?Why you you want to use it? What can you do with it? Where is it going?In this talk, I will discuss why and how scikit-learn became popular. Iwill argue that it is successful because of its vision: it fills an important slot in the rich ecosystem of data science. I will demonstrate how scikit-learn makes predictive analysis easy and yet versatile.I will shed some light on our development process: how do we, as a community, ensure the quality and the growth of scikit-learn?
Succeeding in academia despite doing good_softwareGael Varoquaux
Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...ssuser7dcef0
Power plants release a large amount of water vapor into the
atmosphere through the stack. The flue gas can be a potential
source for obtaining much needed cooling water for a power
plant. If a power plant could recover and reuse a portion of this
moisture, it could reduce its total cooling water intake
requirement. One of the most practical way to recover water
from flue gas is to use a condensing heat exchanger. The power
plant could also recover latent heat due to condensation as well
as sensible heat due to lowering the flue gas exit temperature.
Additionally, harmful acids released from the stack can be
reduced in a condensing heat exchanger by acid condensation. reduced in a condensing heat exchanger by acid condensation.
Condensation of vapors in flue gas is a complicated
phenomenon since heat and mass transfer of water vapor and
various acids simultaneously occur in the presence of noncondensable
gases such as nitrogen and oxygen. Design of a
condenser depends on the knowledge and understanding of the
heat and mass transfer processes. A computer program for
numerical simulations of water (H2O) and sulfuric acid (H2SO4)
condensation in a flue gas condensing heat exchanger was
developed using MATLAB. Governing equations based on
mass and energy balances for the system were derived to
predict variables such as flue gas exit temperature, cooling
water outlet temperature, mole fraction and condensation rates
of water and sulfuric acid vapors. The equations were solved
using an iterative solution technique with calculations of heat
and mass transfer coefficients and physical properties.
4. Industry challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
On some dirty-data problems,
progress in machine learning
can ease the pain
5. Talk outline
1 What models cannot fit
2 Learning with missing values
3 Machine learning on dirty categories
G Varoquaux 3
6. 1 What models cannot fit
Outside of statistics’ comfort zone (X ∈ Rn×p
)
G Varoquaux 4
7. 1 The full life-cycle of a data-science project
Framing the domain question
Finding and understanding the data
Assembling and reshaping it
Designing an AI / statistical model?
Evaluating model performance
Inspecting the model for unwanted behavior
Bringing the model to stakeholders / production
?: what we think is cool
G Varoquaux 5
8. 1 Understanding the data, between human and machine
Age
60
26
38
139
52
86
17
48
Just numbers
G Varoquaux 6
9. 1 Understanding the data, between human and machine
Age
60
26
38
?? 139
52
86
17
48
Numbers with a
meaning
A numerical column expresses a quantity, with a corresponding scale...
G Varoquaux 6
10. 1 Understanding the data, between human and machine
Age Name
60 Bono
26 Justin Bieber
38 Giselle Knowles-Carter?
139 Pablo Picasso
52 Céline Dion
86 Léonard Cohen
17 Greta Thunberg
48 Justin Trudeau
? Beyonce
A numerical column expresses a quantity, with a corresponding scale...
Recognized entries shed light on the numbers
G Varoquaux 6
11. 1 Understanding the data, between human and machine
Age Name Born in Activity
60 Bono Ireland Singer
26 Justin Bieber Canada Singer
38 Giselle Knowles-Carter?
USA Singer
139 Pablo Picasso Spain Painter
52 Céline Dion Canada Singer
86 Léonard Cohen Canada Singer
17 Greta Thunberg Sweden Activist
48 Justin Trudeau Sweden Politician
? Beyonce
A numerical column expresses a quantity, with a corresponding scale...
Recognized entries shed light on the numbers
They can be used to bring in additional information (features)
G Varoquaux 6
12. 1 Understanding the data, between human and machine
Age Name Born in Activity
60 Bono Ireland Singer
26 Justin Bieber Canada Singer
38 Giselle Knowles-Carter?
USA Singer
139 Pablo Picasso Spain Painter
52 Céline Dion Canada Singer
86 Léonard Cohen Canada Singer
17 Greta Thunberg Sweden Activist
48 Justin Trudeau Sweden Politician
? Beyonce
A numerical column expresses a quantity, with a corresponding scale...
Recognized entries shed light on the numbers
They can be used to bring in additional information (features)
And find errors
Knowledge representation, relational algebra
G Varoquaux 6
13. 1 Assembling data, of different natures and sources
Age Name Position
60 John Doe Electrician
48 Jane Austen Senior Professor
52 Jack Daniels Professor
Position Salary
Electrician 35 lizards
Professor 13 horses
Senior Professor 1 dragon
To model the link between age and salary, a join is necessary
Databases:
To maintain consistency and min-
imize storage, data are normal-
ized: multiple tables are use to
minimize redundancy.
Statistics:
Needs samples and features: mul-
tiple observations of the same
kind
⇒ data is denormalized in 1 table
Age Name Position Salary Coffees/day
60 John Doe Electrician 35 lizards 2
48 Jane Austen Senior Professor 1 dragon 128
G Varoquaux 7
14. 1 Aggregations – long vs wide tables
Person ID Measure type Value
12345 Blood Pressure 139
45673 Sugar Level 113
12345 Heart Rate 71
45673 Blood Pressure 84
Long table
Flexible data representation
Person Blood Sugar Heart Rate
ID Pressure Level Rate
12345 139 NA 71
45673 84 113 NA
Wide table
Amenable to statistics on Person
Long to wide in Pandas: unstack, pivot
Also: count coffes per day per person from coffee-machine logs
G Varoquaux 8
15. 1 Data wrangling: assembling unfamiliar sources
Relational algebra:
joins
aggregations (# coffees a day)
selections (finding the data)
Challenges:
understanding the data store
and domain logic
errors in the data
(correspondances in names)
Age Name Country Position Coffees/day
48 Justin Trudeau Canada Prime minister 3000
NA Gaël Varoquaux NA NA NA
G Varoquaux 9
16. 1 Data wrangling: assembling unfamiliar sources
Relational algebra:
joins
aggregations (# coffees a day)
selections (finding the data)
Challenges:
understanding the data store
and domain logic
errors in the data
(correspondances in names)
In health:
Assembling information across large
electronic health records systems
G Varoquaux 9
17. 1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
G Varoquaux 10
18. 1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
G Varoquaux 10
19. 1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
Survival bias
Data loss related to
the process under
study
(survival models)
G Varoquaux 10
20. 1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
Survival bias
Data loss related to
the process under
study
(survival models)
Partly addressed by machine-learning models for
dataset shift (transfer learning) if you know the bias.
Brings us back to understanding the data
G Varoquaux 10
21. Data-science is much more than fitting a statistical model
Data require assembling information
Different data sources = different conventions
Measurements come with errors and biases
These challenges require domain knowledge and data wrangling
G Varoquaux 11
22. 2 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 12
23. Why doesn’t the #$@! machine learning toolkit work?!
Machine learning models need entries in a vector space (or at least
a metric space).
NA /
∈ R
More than an implementation problem
G Varoquaux 13
24. Why doesn’t the #$@! machine learning toolkit work?!
Machine learning models need entries in a vector space (or at least
a metric space).
NA /
∈ R
More than an implementation problem
Categorical entries are discrete anyhow
For missing values in categorical variables, create a
special categorie ”missing”.
Rest of talk on NA in numerical variables
G Varoquaux 13
25. 2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
observed(x0
, mi) = observed(xi, mi) ⇒ gφ(mi|x0
) = gφ(mi|xi)
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
G Varoquaux 14
26. 2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
G Varoquaux 14
27. 2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0 2
2
0
2
MNAR
G Varoquaux 14
28. 2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0 2
2
0
2
MNAR
But
There isn’t always an unobserved value
Age of spouse of singles?
Machine-learning’s goal is not to maximize likelihoods
G Varoquaux 14
29. 2 Imputation
Fill in information Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA
–2000 Social Worker IV
M 07/16/2007 Police Officer III
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA
–2012 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA
–2014 Library Assistant I
Large statistical literature
Procedures and results focused on in sample settings
How about completing the test set with the train set?
What to do with the prediction target y?
G Varoquaux 15
30. 2 Imputation and prediction with test-time missing values
Settings: y = f (x) + ε
Theorem [Josse... 2019]
f : trained predictor achieving Bayes risk on full data
Conditional multiple imputation achieves Bayes risk on test set
with missing data (in MAR settings)
f ?
mult imput(x̃) = Exm|Xo=xo
[f (xm, Xo)].
Notations: x̃ ∈ (R ∪ NA)p
: data at hand
xo: observed values
xm: unobserved values
G Varoquaux 16
31. 2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
G Varoquaux 17
32. 2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
bad computational scalability
G Varoquaux 17
33. 2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
bad computational scalability
Classic statistics point of view
Mean imputation is disastrous, be-
cause it disorts the distribution
“Congeniality” conditions: good im-
putation must preserve data propeties
used by later analysis steps
2 0 2
3
2
1
0
1
2
3
G Varoquaux 17
34. 2 Constant imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent) imputing both train
and test with the mean of train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and compensates at test
time
G Varoquaux 18
35. 2 Constant imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent) imputing both train
and test with the mean of train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and compensates at test
time
Constant imputation breaks simple models (eg linear models)
[Morvan... 2020]
G Varoquaux 18
36. 2 Imputation for supervised learning
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.65
0.70
0.75
0.80
r2
score
Mean
Iterative
Convergence
0.725 0.750 0.775
r2 score
Iterative
Mean
Small small size
Notebook: github – @nprost / supervised missing
Conclusions: IterativeImputer is useful for small sample sizes
G Varoquaux 19
37. 2 Imputation is not enough: predictive missingness
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
G Varoquaux 20
38. 2 Imputation is not enough: predictive missingness
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
Simulation: y depends indirectly on missingness censoring
102 103 104
Sample size
0.75
0.80
0.85
0.90
0.95
r2
score
Mean
Mean+
indicator
Iterative
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Iterative
Mean+
indicator
Mean
Small small size
Notebook: github – @nprost / supervised missing
Adding a mask is crucial
Iterative imputation can be detrimental
G Varoquaux 20
39. 2 Tree models with missing values
MIA (Missing Incorporated Attribute)
[Josse... 2019] x10< -1.5 ?
x2< 2 ?
Yes/Missing
x7< 0.3 ?
No
...
Yes
...
No/Missing
x1< 0.5 ?
Yes
...
No/Missing
... Predict +1.3
sklearn.ensemble.HistGradientBoostingClassifier
The learner readily
handles missing values
G Varoquaux 21
40. 2 Tree models with missing values (MCAR)
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.70
0.75
0.80
r2
score
Inside trees
Mean
Iterative
Convergence
0.75 0.80
r2 score
Iterative
Mean
Inside trees
Small small size
Notebook: github – @nprost / supervised missing
G Varoquaux 22
41. 2 Tree models with missing values (censored)
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.7
0.8
0.9
r2
score
Inside trees
Mean
Iterative
Mean+
indicator
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Mean+
indicator
Iterative
Mean
Inside trees
Small small size
Notebook: github – @nprost / supervised missing
G Varoquaux 23
42. 2 Neural networks with missing values
Gradient-based optimization of continuous models
Difficulty: Half-discrete input space (NA ∪ R)
Y = β?
1X1 + β?
2X2 + β?
0
cor(X1, X2) = 0.5.
If X2 is missing, the coefficient
of X1 should compensate for
the missingness of X2.
up to 2d
set of slopes
effect of X2lost effect of X2
accounted for by
X1
G Varoquaux 24
43. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
G Varoquaux 25
44. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less data
G Varoquaux 25
45. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less data
Also suitable for MNAR settings
G Varoquaux 25
46. Learning with missing values
Imputation is motivated only in MAR settings
Rather than a sophisticated imputation,
use a powerful supervised learner
sklearn’s HistGradientBoostingClassifier
readily models missing values
Can work in MNAR settings
Different regime as standard statistics
G Varoquaux 26
47. 3 Machine learning on dirty categories
[Cerda... 2018, Cerda and Varoquaux 2020]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
G Varoquaux 27
48. 3 Categorical entries in a statistical model
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Master Police Officer Social Worker IV Police Officer II
1 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
One-hot encoding X ∈ Rn×p
G Varoquaux 28
49. 3 Non-normalized categorical entries in a statistical model
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Break OneHotEncoder
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 29
50. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
G Varoquaux 30
51. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Embedding closeby categories with the same y can help
building a simple decision function.
G Varoquaux 30
52. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import TargetEncoder
t a r g e t e n c o d e r = TargetEncoder ()
t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df )
G Varoquaux 30
53. 3 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 31
54. 3 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult without supervision
Potentially suboptimal
Pfizer Corporation Hong Kong
=
? Pfizer Pharmaceuticals Korea
G Varoquaux 31
55. 3 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Harder than supervised learning
G Varoquaux 31
56. Our goal: supervised learning on dirty categories
The statistical question should
inform curation
Pfizer Corporation Hong Kong
=
?
Pfizer Pharmaceuticals Korea
G Varoquaux 32
57. 3 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
new categories?
link categories?
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 33
58. 3 Some string similarities
Levenshtein
Number of edit on one string to match the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 34
59. 3 Python implementation: DirtyCat
DirtyCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import S i m i l a r i t y E n c o d e r
s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’)
t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df )
G Varoquaux 35
61. 3 Dirty categories blow up dimension
New words in
natural language
G Varoquaux 36
62. 3 Dirty categories blow up dimension
New words in
natural language
X ∈ Rn×p
, p is large
Statistical problems
Computational problems
G Varoquaux 36
63. 3 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number of prototypes?
G Varoquaux 37
64. 3 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number of prototypes?
All training-set ⇒ huge dimensionality
Most frequent?
Maybe the right prototypes /
∈ training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 37
65. 3 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 38
66. 3 Modeling substrings [Cerda and Varoquaux 2020]
Model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Models strings as a combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
sklearn.feature extraction.text
CountVectorizer
analyzer : ’word’, ’char’, ’char wb’
HashingVectorizer fast, stateless
TfidfVectorizer normalize counts
G Varoquaux 39
67. 3 Latent category model [Cerda and Varoquaux 2020]
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 39
68. 3 String models of latent categories [Cerda and Varoquaux 2020]
Encodings
that extract
latent
categories
b
r
a
r
y
r
a
t
o
r
a
l
i
s
t
h
o
u
s
e
n
a
g
e
r
u
n
i
t
y
e
s
c
u
e
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
s
Categories
G Varoquaux 40
69. 3 String models of latent categories [Cerda and Varoquaux 2020]
Inferring
plausible
feature
names
s
t
a
n
t
,
l
i
b
r
a
r
y
m
e
n
t
,
o
p
e
r
a
t
o
r
o
n
,
s
p
e
c
i
a
l
i
s
t
k
e
r
,
w
a
r
e
h
o
u
s
e
o
g
r
a
m
,
m
a
n
a
g
e
r
n
i
c
,
c
o
m
m
u
n
i
t
y
e
s
c
u
e
r
,
r
e
s
c
u
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 40
70. 3 Data science with dirty categories
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 41
71. Learning does not require clean entities
Model continuous similarities across entries
Sub-string models can capture theses
Requires a powerful statistical model (Gradient-boosted trees)
Explainable machine-learning techniques to give insight
G Varoquaux 42
72. @GaelVaroquaux
Machine learning with dirty data
What models cannot fit
Dirty categories
Missing values
Understanding and formatting data is unavoidable
Master these aspects
Powerful machine-learning models can cope with dirtyness
- If it is well represented (representing similarities and missingness)
- If they have supervision information
73. 4 References I
P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
variables. Transactions in Data and Knowledge Engineering, 2020.
P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
categorical variables. Machine Learning, 2018.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
networks: differential programming for supervised learning with missing values.
In Advances in Neural Information Processing Systems 33, 2020.
D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical
attributes in classification and prediction problems. ACM SIGKDD
Explorations Newsletter, 3(1):27–32, 2001.
74. 4 References II
M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor
on linearly-generated data with missing values: non consistency and solutions.
AISATS, 2020.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.