Leveraging Machine Learning or IA in order to detect Credit Card Fraud and suspicious transations. The aim of this presentation is to help you to improve your knowledge in Machnie Learning and to start development of multiple families of algorithms in Python.
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
Learning machine learning with YellowbrickRebecca Bilbro
Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018Codemotion
In machine learning, training large models on a massive amount of data usually improves results. Our customers report, however, that training such models and deploying them is either operationally prohibitive or outright impossible for them. We created a collection of machine learning algorithms that scale to any amount of data, including k-means clustering for data segmentation, factorization machines for recommendations, time-series forecasting, linear regression, topic modeling, and image classification. This talk will discuss those algorithms, understand where and how they can be used.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
Learning machine learning with YellowbrickRebecca Bilbro
Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018Codemotion
In machine learning, training large models on a massive amount of data usually improves results. Our customers report, however, that training such models and deploying them is either operationally prohibitive or outright impossible for them. We created a collection of machine learning algorithms that scale to any amount of data, including k-means clustering for data segmentation, factorization machines for recommendations, time-series forecasting, linear regression, topic modeling, and image classification. This talk will discuss those algorithms, understand where and how they can be used.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...wajrcs
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Discovery and Mitigation
Niels Bantilan, New York, NY, https://arxiv.org/abs/1710.06921 (2017)
Author: Waqar Alamgir
https://github.com/waqar-alamgir/Fairness-aware-Machine-Learning
해당 자료는 풀잎스쿨 18기 중 "설명가능한 인공지능 기획!" 진행 중 Counterfactual Explanation 세션에 대해서 정리한 자료입니다.
논문, Youtube 및 하기 자료를 바탕으로 정리되었습니다.
https://christophm.github.io/interpretable-ml-book/
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Enhanced ID3 algorithm based on the weightage of the AttributeAM Publications
ID3 algorithm a decision tree classification algorithm is very popular due to its speed and simplicity in construction but it has its own snags while classifying the ID3 algorithm and tends to choose the attributes with large values and practical complexities arises due to this. To solve this problem the proposed algorithm empowers and uses the importance of the attributes and classifies accordingly to produce effective rules. The proposed algorithm uses the attribute weightage and calculates the information gain for the few values attributes and performs quite better when compared to classical ID3 algorithm. The proposed algorithm is applied on a real time data (i.e.) selection process of employees in a firm for appraisal based on few important attributes and executed.
Repurposing Classification & Regression Trees for Causal Research with High-D...Galit Shmueli
Keynote at WOMBAT 2019 (Monash University) https://www.monash.edu/business/wombat2019
Abstract:
Studying causal effects and structures is central to research in management, social science, economics, and other areas, yet typical analysis methods are designed for low-dimensional data. Classification & Regression Trees ("trees") and their variants are popular predictive tools used in many machine learning applications and predictive research, as they are powerful in high-dimensional predictive scenarios. Yet trees are not commonly used in causal-explanatory research. In this talk I will describe adaptations of trees that we developed for tackling two causal-explanatory issues: self selection and confounder detection. For self selection, we developed a novel tree-based approach adjusting for observable self-selection bias in intervention studies, thereby creating a useful tool for analysis of observational impact studies as well as post-analysis of experimental data which scales for big data. For tackling confounders, we repurose trees for automated detection of potential Simpson's paradoxes in data with few or many potential confounding variables, and even with very large samples. I'll also show insights revealed when applying these trees to applications in eGov, labor economics, and healthcare.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate
between algorithms with statistical implementation provides better consequence in terms of accurate
prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical
models, which provide less manual calculations. Presage is the essence of data science and machine
learning requisitions that impart control over situations. Implementation of any dogmas require proper
feature extraction which helps in the proper model building that assist in precision. This paper is
predominantly based on different statistical analysis which includes correlation significance and proper
categorical data distribution using feature engineering technique that unravel accuracy of different models
of machine learning algorithms.
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
Function Approximation is a popular engineering problems used in system identification or Equation
optimization. Due to the complex search space it requires, AI techniques has been used extensively to spot
the best curves that match the real behavior of the system. Genetic algorithm is known for their fast
convergence and their ability to find an optimal structure of the solution. We propose using a genetic
algorithm as a function approximator. Our attempt will focus on using the polynomial form of the
approximation. After implementing the algorithm, we are going to report our results and compare it with
the real function output.
This presentation is a case study including a diagnostic of the Digital Marketing current implementation at Dupont de Neumours Personal Protections Europe. It also includes Opportunities and technical solutions to implement and monitor a Web 2.0 strategy in the future.
RFC's impact on project using Kolmogorov model and PythonJean-Luc Caut
The aim of this Powerpoint is to use a Kolmogorov model in order to describe the impact of Request For Change (RFC) on Project Life Cycle.
The hidden goal is to ingnite a spark of interest in Project Managers to go further than just learning the basic knowledge provided by PMI with its PMBoK:
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...wajrcs
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Discovery and Mitigation
Niels Bantilan, New York, NY, https://arxiv.org/abs/1710.06921 (2017)
Author: Waqar Alamgir
https://github.com/waqar-alamgir/Fairness-aware-Machine-Learning
해당 자료는 풀잎스쿨 18기 중 "설명가능한 인공지능 기획!" 진행 중 Counterfactual Explanation 세션에 대해서 정리한 자료입니다.
논문, Youtube 및 하기 자료를 바탕으로 정리되었습니다.
https://christophm.github.io/interpretable-ml-book/
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Enhanced ID3 algorithm based on the weightage of the AttributeAM Publications
ID3 algorithm a decision tree classification algorithm is very popular due to its speed and simplicity in construction but it has its own snags while classifying the ID3 algorithm and tends to choose the attributes with large values and practical complexities arises due to this. To solve this problem the proposed algorithm empowers and uses the importance of the attributes and classifies accordingly to produce effective rules. The proposed algorithm uses the attribute weightage and calculates the information gain for the few values attributes and performs quite better when compared to classical ID3 algorithm. The proposed algorithm is applied on a real time data (i.e.) selection process of employees in a firm for appraisal based on few important attributes and executed.
Repurposing Classification & Regression Trees for Causal Research with High-D...Galit Shmueli
Keynote at WOMBAT 2019 (Monash University) https://www.monash.edu/business/wombat2019
Abstract:
Studying causal effects and structures is central to research in management, social science, economics, and other areas, yet typical analysis methods are designed for low-dimensional data. Classification & Regression Trees ("trees") and their variants are popular predictive tools used in many machine learning applications and predictive research, as they are powerful in high-dimensional predictive scenarios. Yet trees are not commonly used in causal-explanatory research. In this talk I will describe adaptations of trees that we developed for tackling two causal-explanatory issues: self selection and confounder detection. For self selection, we developed a novel tree-based approach adjusting for observable self-selection bias in intervention studies, thereby creating a useful tool for analysis of observational impact studies as well as post-analysis of experimental data which scales for big data. For tackling confounders, we repurose trees for automated detection of potential Simpson's paradoxes in data with few or many potential confounding variables, and even with very large samples. I'll also show insights revealed when applying these trees to applications in eGov, labor economics, and healthcare.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...ijcseit
Scrutiny for presage is the era of advance statistics where accuracy matter the most. Commensurate
between algorithms with statistical implementation provides better consequence in terms of accurate
prediction by using data sets. Prolific usage of algorithms lead towards the simplification of mathematical
models, which provide less manual calculations. Presage is the essence of data science and machine
learning requisitions that impart control over situations. Implementation of any dogmas require proper
feature extraction which helps in the proper model building that assist in precision. This paper is
predominantly based on different statistical analysis which includes correlation significance and proper
categorical data distribution using feature engineering technique that unravel accuracy of different models
of machine learning algorithms.
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
Function Approximation is a popular engineering problems used in system identification or Equation
optimization. Due to the complex search space it requires, AI techniques has been used extensively to spot
the best curves that match the real behavior of the system. Genetic algorithm is known for their fast
convergence and their ability to find an optimal structure of the solution. We propose using a genetic
algorithm as a function approximator. Our attempt will focus on using the polynomial form of the
approximation. After implementing the algorithm, we are going to report our results and compare it with
the real function output.
This presentation is a case study including a diagnostic of the Digital Marketing current implementation at Dupont de Neumours Personal Protections Europe. It also includes Opportunities and technical solutions to implement and monitor a Web 2.0 strategy in the future.
RFC's impact on project using Kolmogorov model and PythonJean-Luc Caut
The aim of this Powerpoint is to use a Kolmogorov model in order to describe the impact of Request For Change (RFC) on Project Life Cycle.
The hidden goal is to ingnite a spark of interest in Project Managers to go further than just learning the basic knowledge provided by PMI with its PMBoK:
Modelisation of Ebola Hemoragic Fever propagation in a modern cityJean-Luc Caut
The study of epidemic disease has always been a topic where biological issues mix with social ones.
The aim of this presentation was to modelize in Python language the propagation of Ebola Hemoragic Fever in a modern city thus using SIR model based on Ordinary Differential Equations system and also to produce an amazing Cellular Automaton.
The aim of this presentation is to help Digital Marketing managers to implement an efficient e-marketing strategy in the particular and constrained environment of the pharmaceutical industry. This presentation can also be a good opportunity for Operational Marketing professionals jammed with the traditional 4p to realise that implementing a 360° marketing strategy is not only aligning Web and Marketing (or vice versa).
I took the opportunity of the success of my previous release to enhance and complete some slides in this V2.0. You will discover how a Biopharmaceutical company (Celgene) has taken a good start after my advises in 2010 and how they have implemented an e-marketing strategy with the evolution of their Internet portals and their connections to medical portals.
It seems to me that you can significantly improve your knowledge of e-marketing tactics with free tools in order to audit and monitor your consumer’s behaviours in the digital space by reading my other presentation: Digital Marketing Management.
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
In this paper Compare the performance of two
classification algorithm. I t is useful to differentiate
algorithms based on computational performance rather
than classification accuracy alone. As although
classification accuracy between the algorithms is similar,
computational performance can differ significantly and it
can affect to the final results. So the objective of this paper
is to perform a comparative analysis of two machine
learning algorithms namely, K Nearest neighbor,
classification and Logistic Regression. In this paper it
was considered a large dataset of 7981 data points and 112
features. Then the performance of the above mentioned
machine learning algorithms are examined. In this paper
the processing time and accuracy of the different machine
learning techniques are being estimated by considering the
collected data set, over a 60% for train and remaining
40% for testing. The paper is organized as follows. In
Section I, introduction and background analysis of the
research is included and in section II, problem statement.
In Section III, our application and data analyze Process,
the testing environment, and the Methodology of our
analysis are being described briefly. Section IV comprises
the results of two algorithms. Finally, the paper concludes
with a discussion of future directions for research by
eliminating the problems existing with the current
research methodology.
Is Machine Learning… a piece of cake? 10 minutes to give you a first taste of Machine Learning.
BeeBryte - Energy Intelligence & Automation
www.beebryte.com
This is an elaborate presentation on how to predict employee attrition using various machine learning models. This presentation will take you through the process of statistical model building using Python.
AI Professionals use top machine learning algorithms to automate models that analyze more extensive and complex data which was not possible in older machine learning algos.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
This brief work is aimed in the direction of basics of data sciences and model building with focus on implementation on fairly sizable dataset. It focuses on cleaning the data, visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic regression, random forests, cross validation without delving too deep into any of them but giving a start to a new learner.
In Machine Learning in Credit Risk Modeling, we provide an explanation of the main Machine Learning models used in James so that Efficiency does not come at the expense of Explainability.
(Contact Yvan De Munck for more info or to receive other and future updates on the subject @yvandemunck or yvan@james.finance)
Deep dive into the mathematics and algorithms of neural nets. Covers the sigmoid activation function, cross-entropy loss function, gradient descent and the derivatives used in back propagation.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
4. Jean-LucCAUT-2016
Machine Learning
is a subfield of Computer Science that evolved from the study of pattern
recognition and computational learning theory in artificial intelligence.
In 1959 Arthur Samuel, defined machine learning as a:
"Field of study that gives computers the ability to learn without being
explicitly programmed.”
Let’s go further and have look on what is hidden behind the scenes.
5. Jean-LucCAUT-2016
Machine Learning is often used to build predictive models by extracting patterns
from large datasets.
These models are used in predictive data analytics applications including price
prediction, risk assessment, predicting customer behavior, and document
classification.
This presentation offers a detailed and focused treatment of one the most
important machine learning approach used in predictive data analytics,
covering both theoretical concepts and practical applications.
Technical and mathematical material is augmented with explanatory worked
example developed in Python in order to illustrate the application of these
models in the financial business context.
6. Jean-LucCAUT-2016
Machine Learning tasks are typically classified into three broad categories,
depending on the nature of the learning "signal" or "feedback" available to a
learning system. These are:
Supervised learning: The computer is presented with example inputs and their
desired outputs, given by a "teacher", and the goal is to learn a general rule that
maps inputs to outputs.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data) or a means towards an end (feature learning).
Reinforcement learning: A computer program interacts with a dynamic environment
in which it must perform a certain goal (such as driving a vehicle), without a teacher
explicitly telling it whether it has come close to its goal. Another example is learning
to play a game by playing against an opponent.
7. Jean-LucCAUT-2016
Visualizing the important characteristics of a dataset
Exploratory Data Analysis (EDA) is an important and recommended first step prior to the
training of a machine learning model.
First, we will create a scatterplot matrix that allows us to visualize the pair-wise
correlations between the different features in this dataset in one place.
8. Jean-LucCAUT-2016
Correlation Matrix
To quantify the linear relationship between the features, we will now create a correlation
matrix.
The correlation matrix is a square matrix that contains the Pearson product-
moment correlation coefficients (often abbreviated as Pearson's r), which measure
the linear dependence between pairs of features.
For example, we can see that
there is a linear relationship
between RM and the
housing prices MEDV.
Or between NOX emission
and the surface of industries
INDUS.
10. Jean-LucCAUT-2016
Supervised learning is the machine learning task of inferring a function from
labeled training data. The computer is presented with example inputs and their
desired outputs, given by a "teacher", and the goal is to learn a general rule
that maps inputs to outputs.
The training data consist of a set of training examples. In supervised learning, each
example is a pair consisting of an input object (typically a vector) and a desired
output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred
function, which can be used for mapping new examples. An optimal scenario will
allow for the algorithm to correctly determine the class labels for unseen instances.
This requires the learning algorithm to generalize from the training data to unseen
situations in a "reasonable" way .
11. Jean-LucCAUT-2016
Determine the type of training
examples
Gather a training set.
Determine the input feature
representation of the learned
function.
Determine the structure of the
learned function and corresponding
learning algorithm.
Run the learning algorithm on the
gathered training set.
Evaluate the accuracy of the
learned function.
Learning Process
12. Jean-LucCAUT-2016
A quick look at our dataset allows us
to notice that Petal length and width
could be good candidates for our
classification.
This step called dimensionality
reduction of our feature space. The
main advantage is that the learning
algorithm will run much faster.
A potential use of supervised learning model is classification.
The Iris dataset is a classic example in the field of machine learning, it contains the
measurements of 150 iris flowers from three different species: Setosa, Versicolor, and
Viriginica.
Here, each flower Sample represents one row in our data set, and the flower
measurements in centimeters are stored as columns, which we also call the Features of the
dataset.
Data set is available at: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
13. Jean-LucCAUT-2016
Many different machine learning algorithms have been developed to solve different
problem tasks.
An important point that can be summarized from David Wolpert's famous No Free Lunch
Theorems is that we can't get learning "for free" (The Lack of A Priori Distinctions Between Learning
Algorithms, D.H. Wolpert 1996; No Free Lunch Theorems for Optimization, D.H. Wolpert and W.G. Macready, 1997).
For example, each classification algorithm has its inherent biases, and no single
classification model enjoys superiority if we don't make any assumptions about the task.
In practice, it is therefore essential to compare at least a handful of different algorithms in
order to train and select the best performing model.
“Would you tell me, please,
which way I ought to go from here?” Said Alice
“That depends a good deal on where you want to get to,” said the Cat.
Alice in Wonderland, Lewis Carroll
14. Jean-LucCAUT-2016
Linear classification model
the Logistic Regression and the conditional probabilities
Logistic regression is the most widely used algorithms for classification in industry. It is very
easy to implement but performs very well on linearly separable classes.
To explain the idea behind logistic regression as a probabilistic model, let's first introduce
the odds ratio, which is the odds in favor of a particular event.
The term positive event does refers to the event that we want to predict, e.g. the
probability that a patient has a certain disease. We can then further define he logit
function, which is simply the logarithm of the odds ratio where p stands for the probability
of the positive event.
The logit function takes input values in the range 0 to 1 and transforms them to values over
the entire real number range, which we can use to express a linear relationship between
feature values and the log-odds:
15. Jean-LucCAUT-2016
Then we are interested in predicting the probability that a certain sample belongs to a
particular class, which is the inverse form of the logit function. It is also called the logistic
function, sometimes simply abbreviated as sigmoid function due to its characteristic S-
shape.
Here, z is the net input, that is, the linear combination of weights and sample features and
can be calculated as:
The output of the sigmoid function is then interpreted as the probability of particular
sample belonging to class 1, given its features x parameterized by the
weights w.
Z
16. Jean-LucCAUT-2016
If we compute for a particular flower sample, it means that the chance that this
sample is an Iris-Versicolor flower is 80 percent.
Similarly, the probability that this flower is an Iris-Setosa flower can be calculated as
or 20 percent.
The predicted probability can then simply be
converted into a binary outcome via a quantizer.
18. Jean-LucCAUT-2016
What is a good classifier?
Well calibrated classifiers are probabilistic classifiers for which the output of the
predict_proba method can be directly interpreted as a confidence level.
Well calibrated (binary) classifier should classify the samples such that among the samples
to which it gave a predict_proba value close to 0.8, approx. 80% actually belong to the
positive class.
LogisticRegression returns well calibrated predictions as it directly optimizes log-loss.
GaussianNaiveBayes tends to push probabilties to 0 or 1. This is mainly because it
makes the assumption that features are conditionally independent given the class,
which is not the case in this dataset which contains 2 redundant features.
RandomForestClassifier shows the opposite behavior: Errors caused by variance tend
to be one-sided near zero and one. We observe this effect most strongly with random
forests because the base-level trees trained have relatively high variance due to feature
subseting.
Support Vector Classification (SVC) shows an even more sigmoid curve as the
RandomForestClassifier, which is typical for maximum-margin methods, which focus on
hard samples that are close to the decision boundary (the support vectors).
20. Jean-LucCAUT-2016
Example with a Logistic Regression Classifier:
Logistic regression, despite its name, is a linear
model for classification rather than regression.
Logistic regression is also known in the
literature as logit regression, maximum-entropy
classification or the log-linear classifier.
In this model, the probabilities describing the
possible outcomes of a single trial are modeled
using a logistic function.
23. Jean-LucCAUT-2016
Simple Least Square model
As we can see in the following plot,
the linear regression line reflects the
general trend that house prices tend
to increase with the number of rooms:
Data set is available at: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
To see our Linear Regression model in action, let's use the RM (number of rooms)
variable from the Housing Data Set as the explanatory variable to train a model that
can predict MEDV (the housing prices).
24. Jean-LucCAUT-2016
Regression model wrapped in RANSAC algorithm
Data set is available at: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
Linear regression models can be heavily impacted by the presence of outliers. In certain
situations, a very small subset of our data can have a big effect on the estimated model
coefficients.
As an alternative to throwing out
outliers, we will look at a robust method
of regression using the RANdom
SAmple Consensus (RANSAC) algorithm,
which fits a regression model to a subset
of the data, the so-called inliers.
Using RANSAC, we don't know if this
approach has a positive effect on the
predictive performance for unseen data.
Thus, in the next section we will discuss
how to evaluate a model for different
approaches.
26. Jean-LucCAUT-2016
Non Linear classification model
Using a Kernel SVM
SVMs enjoy high popularity among machine learning practitioners because they can be
easily kernelized to solve nonlinear classification problems.
The basic idea behind kernel methods to deal with such linearly inseparable data is to
create nonlinear combinations of the original features to project them onto a higher
dimensional space via a mapping function Ø() where it becomes linearly separable.
To solve a nonlinear problem using an SVM,
we transform the training data onto a higher
dimensional feature space via the mapping
function Ø() and train a linear SVM model to
classify the data in this new feature space.
Then we can use the same mapping function
Ø() to transform new, unseen data to classify
it using the linear SVM model.
27. Jean-LucCAUT-2016
As we can see in the resulting plot, the kernel SVM separates the data relatively well:
The g parameter, which we set to gamma=0.1, can be understood as a cut-off parameter
for the Gaussian sphere. If we increase the value for , we increase the influence or reach of
the training samples, which leads to a softer decision boundary.
To get a better intuition for , let's apply RBF kernel SVM to our Iris flower dataset:
28. Jean-LucCAUT-2016
To get a better intuition for g parameter, let's apply RBF kernel SVM to our Iris flower
dataset:
In the resulting plot, we can now see that
the decision boundary around the classes
0 and 1 is much tighter using a relatively
large value of g (100.0) :
31. Jean-LucCAUT-2016
Decision Tree and non linear relationships
Data set is available at: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
To use a decision tree for regression, we will replace entropy as the impurity measure of a
node t by the MSE.
In the context of decision tree regression, the MSE is often also referred to as within-node
variance, which is why the splitting criterion is also better known as variance reduction.
To see what the line fit of a
decision tree looks like, let's
use the DecisionTreeRegressor
implemented in scikit-learn to
model the nonlinear
relationship between the
MEDV and LSTAT variables:
32. Jean-LucCAUT-2016
Data set is available at: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data
Python code for a decision tree for regression.
34. Jean-LucCAUT-2016
When it comes to select among different machine learning algorithms, a recommended
approach is nested cross-validation. Varma and Simon concluded that the true error of
the estimate is almost unbiased relative to the test set when nested cross-validation is
used (S. Varma and R. Simon. Bias in Error Estimation When Using Cross-validation for Model Selection. BMC
bioinformatics, 2006).
Cross-Validation process
Description:
35. Jean-LucCAUT-2016
In the field of Machine learning and specifically the problem of statistical classification,
a confusion matrix, also known as an error matrix, is a specific table layout that allows
visualization of the performance of an algorithm.
Exemple with a cross-validation training model:
Assuming that class 1 (malignant) is the positive class in
this example, our model correctly classified 71 of the
samples that belong to class 0 (True Negatives) and 40
samples that belong to class 1 (True Positives),
respectively.
However, our model also incorrectly misclassified 2
samples from class 0 as class 1 (False Positives) which is
a false alarm, and it predicted that 1 sample is benign
although it is a malignant tumor (False Negatives).
36. Jean-LucCAUT-2016
The error can be understood as the sum of all false predictions divided by the number of
total predictions:
The Accuracy is calculated as the sum of correct predictions divided by the number of total
predictions:
The true positive rate (TPR), false positive rate (FPR) and precision (PRE) are
performance metrics that are especially useful for imbalanced class problems:
37. Jean-LucCAUT-2016
Receiver operator characteristic (ROC) graphs are useful tools for selecting models for
classification based on their performance with respect to the false positive and true
positive rates, which are computed by shifting the decision threshold of the classifier.
The diagonal of an ROC graph can be interpreted as random guessing, and classification
models that fall below the diagonal are considered as worse than random guessing.
A perfect classifier would fall into the top-left corner of the graph with a true positive rate
of 1 and a false positive rate of 0.
Next slide is a plot of a ROC curve of a classifier that only uses two features from the
Breast Cancer Wisconsin dataset to predict whether a tumor is benign or malignant.
Based on the ROC curve, we can also compute the area under the curve (AUC) to
characterize the performance of a classification model.
38. Jean-LucCAUT-2016
The resulting ROC curve indicates that there is a certain degree of variance between the
different folds, and the average ROC AUC (0.75) falls between a perfect score (1.0) and
random guessing (0.5):
39. Jean-LucCAUT-2016
In this example we are going to use a Decision Tree then Random Forest model in order
to detect fraudulent use of Credit Card.
A non linear model will better resolve our problem, we assume that the effect of the
amount is not linear, because the impact of amount could depend on another variable
such as card use in 24h, or maybe small and large charges are most likely to be fraudulent
than charges with moderate amounts…
Let us import a .csv file with 89,393 transactions
40. Jean-LucCAUT-2016
In the following example, we have trained a Decision Tree with a sample of the training
data, starting with a node and pick the split that maximizes the decrease in Gini:
2.p.(1 – p)
41. Jean-LucCAUT-2016
Random forests have gained huge popularity in applications of machine learning during
the last decade due to their good classification performance, scalability, and ease of use.
Intuitively, a random forest can be considered as an ensemble of decision trees.
The idea behind ensemble learning is to combine weak learners to build a more robust
model, a strong learner, that has a better generalization error and is less susceptible to
overfitting. The random forest algorithm can be summarized in four simple steps:
Draw a random bootstrap sample of size n (randomly choose n samples from the
training set with replacement).
Grow a decision tree from the bootstrap sample. At each node:
Randomly select d features without replacement.
Split the node using the feature that provides the best split according to the
objective function, for instance, by maximizing the information gain.
Repeat the steps 1 to 2 k times.
Aggregate the prediction by each tree to assign the class label by majority vote.
42. Jean-LucCAUT-2016
In the following example we have trained N trees, each on a (bootstrapped) sample of
the training data
At each split, we only consider a subset of the available features, say total # of features
of them. Thus reducing correlation among the trees. The final score is the average of the
score produced by each tree.
45. Jean-LucCAUT-2016
In this part we will discuss one of the most popular clustering algorithms, k-means, which
is widely used in academia as well as in industry.
Clustering (or cluster analysis) is a technique that allows us to find groups of similar
objects, objects that are more related to each other than to objects in other groups.
Examples of business-oriented applications of clustering include the grouping of
documents, music, and movies by different topics, or finding customers that share similar
interests based on common purchase behaviors as a basis for recommendation engines.
.
In the following scatterplot,
we can see that k-means
placed the three centroids
at the center of each
sphere, which looks like a
reasonable grouping given
this dataset:
47. Jean-LucCAUT-2016
Hard clustering describes a family of algorithms where each sample in a dataset is
assigned to exactly one cluster, as in the k-means algorithm that we discussed in the
previous slide.
In contrast, algorithms for soft clustering (sometimes also called fuzzy clustering) assign a
sample to one or more clusters. A popular example of soft clustering is the fuzzy C-means
(FCM) algorithm (also called soft k-means or fuzzy k-means).
As we can see in the
following scatterplot, one
of the centroids falls
between two of the three
spherical groupings of the
sample points. Although
the clustering does not
look completely terrible, it
is suboptimal.
48. Jean-LucCAUT-2016
Although we can't cover the vast number of different clustering algorithms in this
chapter, let's at least introduce one more approach to clustering: Density-based Spatial
Clustering of Applications with Noise (DBSCAN). The notion of density in DBSCAN is
defined as the number of points within a specified radius e .
In DBSCAN, a special label is assigned to each sample (point) using the following criteria:
A point is considered as core point if at least a specified number (MinPts) of
neighboring points fall within the specified radius e
A border point is a point that has fewer neighbors than MinPts within , but lies within
the e radius of a core point
All other points that are neither core nor border points are considered as noise points
49. Jean-LucCAUT-2016
For a more illustrative example, let's create a new dataset of half-moon-shaped
structures to compare k-means clustering, hierarchical clustering, and DBSCAN:
We will start by using the k-means algorithm and complete linkage clustering to see
whether one of those previously discussed clustering algorithms can successfully identify
the half-moon shapes as separate clusters.
Based on the visualized clustering results, we can see that the k-means algorithm is
unable to separate the two clusters, and the hierarchical clustering algorithm was
challenged by those complex shapes:
50. Jean-LucCAUT-2016
The DBSCAN algorithm can successfully detect the half-moon shapes, which highlights
one of the strengths of DBSCAN (clustering data of arbitrary shapes)
However, we should also note some of the disadvantages of DBSCAN. With an increasing
number of features in dataset, given a fixed size training set, the negative effect of the
curse of dimensionality increases. This is especially a problem if we are using the
Euclidean distance metric.
However, the problem of the
curse of dimensionality is
not unique to DBSCAN; it
also affects other clustering
algorithms that use the
Euclidean distance metric,
for example, the k-means
and hierarchical clustering
algorithms