Choosing a
ML technique
to solve your need
RALUCA APOSTOL
Nestor
Co-founder & Chief Product Officer
Passion for data - science, analysis & interpretation
Our chat today :)
• What is Machine Learning?
• What are some popular approaches?
• How to think?
• How to build your model?
• What are the resources?
• Let’s get to action.
Artificial Intelligence
• Anything which is not natural and created by humans is artificial.
• Intelligence means ability to understand, reason, plan, adapt, etc.
• So any code, tech or algorithm that enable machine to mimic,
develop or demonstrate the human cognition or behavior is AI
Applications for AI
AI vs. ML
AI vs. ML
Connecting the dots …
Connecting the dots …
So what is Machine Learning?
• Machine learning is the field of study
that gives computers the ability to
learn without being explicitly
programmed.
• In simple term, Machine Learning
means making prediction based on
data.
The ML Mindset
“ML changes the way you think about a problem. The focus shifts
from a mathematical science to a natural science, running
experiments and using statistics, not logic, to analyze its results."
– Peter Norvig
Google Research Director
The ML Mindset
The ML Mindset
F(a,b) = a + b
a = 1; b = 2;
3
The ML Mindset
F(a,b) = a + b
a = 1; b = 2;
3
F(a,b) = x*a + y*b
a = 1; b = 2;
3
x = ?; y = ?
The steps
• Machine learning is part art and part science. :)
• There is no one solution or one approach that fits all.
• There are several factors that can affect your decision to choose a
machine learning algorithm.
The steps
Know your data
Before you start looking at different ML algorithms, you need to have a
clear picture of your data, your problem and your constraints.
Know your data
Before you start looking at different ML algorithms, you need to have a
clear picture of your data, your problem and your constraints.
Know your data
1. Look at Summary statistics and visualizations
• Percentages can help identify the range for most of the data
• Averages and medians can describe central tendency
• Correlations can indicate strong relationships
2. Visualize the data
• Box plots can identify outliers
• Density plots and histograms show the spread of data
• Scatter plots can describe bivariate relationships
Since the collected data may be in an undesired format, unorganized,
or extremely large, further steps are needed to enhance its quality. The
three common steps for preprocessing data are formatting, cleaning,
and sampling.
Clean your data
1. How do I deal with missing value?
• Missing data affects some models more than others.
• Even for models that handle missing data, they can be sensitive to it (missing data
for certain variables can result in poor predictions)
2. Does the data needs to be aggregated?
Clean your data
Clean your data
Clean your data
3. What do I do with outliers?
• Outliers can be very common in multidimensional data.
• Some models are less sensitive to outliers than others.
Usually tree models are less sensitive to the presence of
outliers. However regression models, or any model that
tries to use equations, could definitely be effected by
outliers
• Outliers can be the result of bad data collection, or they can
be legitimate extreme values.
Clean your data
Sampling
Augment your data
1. Feature engineering is the process of going from raw
data to data that is ready for modeling. It can serve
multiple purposes:
• Make the models easier to interpret (e.g. binning)
• Capture more complex relationships (e.g. NNs)
• Reduce data redundancy and dimensionality (e.g. PCA)
• Rescale variables (e.g. standardizing or normalizing)
2. Different models may have different feature engineering
requirements.
Categorize the problem
1. Categorize by input:
• If you have labelled data, it’s a supervised learning problem.
• If you have unlabelled data and want to find structure, it’s an unsupervised
learning problem.
• If you want to optimize an objective function by interacting with an environment,
it’s a reinforcement learning problem.
2. Categorize by output
• If the output of your model is a number, it’s a regression problem.
• If the output of your model is a class, it’s a classification problem.
• If the output of your model is a set of input groups, it’s a clustering problem.
• Do you want to detect an anomaly ? That’s anomaly detection.
Categorize the problem
Understand your constraints
• What is your data storage capacity? Depending on the storage capacity of your
system, you might not be able to store gigabytes of classification/regression
models or gigabytes of data to clusterize. This is the case, for instance, for
embedded systems.
• Does the prediction have to be fast? In real time applications, it is obviously very
important to have a prediction as fast as possible. For instance, in autonomous
driving, it’s important that the classification of road signs be as fast as possible to
avoid accidents.
• Does the learning have to be fast? In some circumstances, training models quickly
is necessary: sometimes, you need to rapidly update, on the fly, your model with a
different dataset.
Choose the algorithm
Identify the algorithms that are applicable and practical to implement using the tools
at your disposal.
Some of the factors affecting the choice of a model are:
• Whether the model meets the business goals
• How much preprocessing the model needs
• How accurate the model is
• How explainable the model is
• How fast the model is: How long does it take to build a model, and how long does
the model take to make predictions.
• How scalable the model is
Take care at complexity
Making the same algorithm more complex increases the chance of overfitting.
A model is more complex when:
• It relies on more features to learn and predict (e.g. using two features vs ten
features to predict a target)
• It relies on more complex feature engineering (e.g. using polynomial terms,
interactions, or principal components)
• It has more computational overhead (e.g. a single decision tree vs. a random
forest of 100 trees).
ML Algorithms
ML Algorithms
Linear Regression
Regression algorithms can be used for example, when you
want to compute some continuous value as compared to
Classification where the output is categoric.
• Time to go one location to another
• Predicting sales of particular product next month
• Impact of blood alcohol content on coordination
• Predict monthly gift card sales and improve yearly
revenue projections
Logistic Regression
Logistic regression performs binary classification, so the
label outputs are binary. It takes linear combination of
features and applies non-linear function (sigmoid) to it, so
it’s a very small instance of neural network.
• Predicting the Customer Churn
• Credit Scoring & Fraud Detection
• Measuring the effectiveness of marketing
campaigns
Decision trees
Single trees are used very rarely, but in composition with
many others they build very efficient algorithms such as
Random Forest or Gradient Tree Boosting.
• Investment decisions
• Customer churn
• Banks loan defaulters
• Build vs Buy decisions
• Sales lead qualifications
K-means
Sometimes you don’t know any labels and your goal is to
assign labels according to the features of objects. This is
called clusterization task.
If there are questions like how is this organized or
grouping something or concentrating on particular groups
etc. in your problem statement then you should go with
Clustering.
• When there is a large group of users and you want to
divide them into particular groups based on some
common attributes.
Principal component analysis
Principal component analysis provides dimensionality
reduction. Sometimes you have a wide range of features,
probably highly correlated between each other, and
models can easily overfit on a huge amount of data. Then,
you can apply PCA.
Support Vector Machines
Support Vector Machine (SVM) is a supervised
machine learning technique that is widely used in
pattern recognition and classification problems — 
when your data has exactly two classes.
• detecting persons with common diseases
such as diabetes
• hand-written character recognition
• text categorization — news articles by topics
• stock market price prediction
Neural networks
Neural Networks take in the weights of connections
between neurons.
Extremely complex models can be trained and they
can be utilized as a kind of black box, without playing
out an unpredictable complex feature engineering
before training the model.
Object recognition has been as of late enormously
enhanced utilizing Deep Neural Networks. Applied to
unsupervised learning tasks, such as feature
extraction, deep learning also extracts features from
raw images or speech with much less human
intervention.
Training & Evaluation
Training & Evaluation
Parameter Tuning
Prediction
What you need to know ?
It is mandatory to learn a programming language, preferably Python, along with the
required analytical and mathematical knowledge. You can touch on:
• Linear algebra for data analysis: Scalars, Vectors, Matrices, and Tensors
• Mathematical Analysis: Derivatives and Gradients
• Probability theory and statistics
• Multivariate Calculus
• Algorithms and Complex Optimizations
Programming Languages
Python is hands down the best programming language for
Machine Learning applications.
Other programming languages that could to use for Machine
Learning Applications are R, C++, JavaScript, Java, C#, Julia,
Shell, TypeScript, and Scala.
Cloud Support
Test
Interactive:
• http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Test with Amazon SageMaker steps:
1. Create an AWS account
2. Launch a SageMaker Instance
3. Created a S3 bucket
4. Launch the Jupyter Notebook
5. Explore examples & tutorials
Libraries
• https://towardsdatascience.com/best-python-libraries-for-machine-
learning-and-deep-learning-b0bd40c7e8c (Python)
• https://blog.bitsrc.io/11-javascript-machine-learning-libraries-to-use-
in-your-app-c49772cca46c (Javascript)
• https://www.baeldung.com/java-ai (Java)
• https://www.geeksforgeeks.org/machine-learning-in-c/ (C++)
Datasets
• https://towardsai.net/p/machine-learning/best-free-datasets-for-
machine-learning-and-data-science/stanfordai/3451/
• https://towardsdatascience.com/free-data-sets-for-machine-
learning-73e74554cc21
• https://www.ubuntupit.com/best-machine-learning-datasets-for-
practicing-applied-ml/
Nestor
Co-founder & Chief Product Officer
Raluca Apostol
raluca@nestorup.com

Choosing a Machine Learning technique to solve your need

  • 1.
    Choosing a ML technique tosolve your need RALUCA APOSTOL
  • 2.
    Nestor Co-founder & ChiefProduct Officer
  • 3.
    Passion for data- science, analysis & interpretation
  • 4.
    Our chat today:) • What is Machine Learning? • What are some popular approaches? • How to think? • How to build your model? • What are the resources? • Let’s get to action.
  • 5.
    Artificial Intelligence • Anythingwhich is not natural and created by humans is artificial. • Intelligence means ability to understand, reason, plan, adapt, etc. • So any code, tech or algorithm that enable machine to mimic, develop or demonstrate the human cognition or behavior is AI
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    So what isMachine Learning? • Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. • In simple term, Machine Learning means making prediction based on data.
  • 12.
    The ML Mindset “MLchanges the way you think about a problem. The focus shifts from a mathematical science to a natural science, running experiments and using statistics, not logic, to analyze its results." – Peter Norvig Google Research Director
  • 13.
  • 14.
    The ML Mindset F(a,b)= a + b a = 1; b = 2; 3
  • 15.
    The ML Mindset F(a,b)= a + b a = 1; b = 2; 3 F(a,b) = x*a + y*b a = 1; b = 2; 3 x = ?; y = ?
  • 16.
    The steps • Machinelearning is part art and part science. :) • There is no one solution or one approach that fits all. • There are several factors that can affect your decision to choose a machine learning algorithm.
  • 17.
  • 18.
    Know your data Beforeyou start looking at different ML algorithms, you need to have a clear picture of your data, your problem and your constraints.
  • 19.
    Know your data Beforeyou start looking at different ML algorithms, you need to have a clear picture of your data, your problem and your constraints.
  • 20.
    Know your data 1.Look at Summary statistics and visualizations • Percentages can help identify the range for most of the data • Averages and medians can describe central tendency • Correlations can indicate strong relationships 2. Visualize the data • Box plots can identify outliers • Density plots and histograms show the spread of data • Scatter plots can describe bivariate relationships
  • 21.
    Since the collecteddata may be in an undesired format, unorganized, or extremely large, further steps are needed to enhance its quality. The three common steps for preprocessing data are formatting, cleaning, and sampling. Clean your data
  • 22.
    1. How doI deal with missing value? • Missing data affects some models more than others. • Even for models that handle missing data, they can be sensitive to it (missing data for certain variables can result in poor predictions) 2. Does the data needs to be aggregated? Clean your data
  • 23.
  • 24.
  • 25.
    3. What doI do with outliers? • Outliers can be very common in multidimensional data. • Some models are less sensitive to outliers than others. Usually tree models are less sensitive to the presence of outliers. However regression models, or any model that tries to use equations, could definitely be effected by outliers • Outliers can be the result of bad data collection, or they can be legitimate extreme values. Clean your data
  • 26.
  • 27.
    Augment your data 1.Feature engineering is the process of going from raw data to data that is ready for modeling. It can serve multiple purposes: • Make the models easier to interpret (e.g. binning) • Capture more complex relationships (e.g. NNs) • Reduce data redundancy and dimensionality (e.g. PCA) • Rescale variables (e.g. standardizing or normalizing) 2. Different models may have different feature engineering requirements.
  • 28.
    Categorize the problem 1. Categorizeby input: • If you have labelled data, it’s a supervised learning problem. • If you have unlabelled data and want to find structure, it’s an unsupervised learning problem. • If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem. 2. Categorize by output • If the output of your model is a number, it’s a regression problem. • If the output of your model is a class, it’s a classification problem. • If the output of your model is a set of input groups, it’s a clustering problem. • Do you want to detect an anomaly ? That’s anomaly detection.
  • 29.
  • 30.
    Understand your constraints •What is your data storage capacity? Depending on the storage capacity of your system, you might not be able to store gigabytes of classification/regression models or gigabytes of data to clusterize. This is the case, for instance, for embedded systems. • Does the prediction have to be fast? In real time applications, it is obviously very important to have a prediction as fast as possible. For instance, in autonomous driving, it’s important that the classification of road signs be as fast as possible to avoid accidents. • Does the learning have to be fast? In some circumstances, training models quickly is necessary: sometimes, you need to rapidly update, on the fly, your model with a different dataset.
  • 31.
    Choose the algorithm Identifythe algorithms that are applicable and practical to implement using the tools at your disposal. Some of the factors affecting the choice of a model are: • Whether the model meets the business goals • How much preprocessing the model needs • How accurate the model is • How explainable the model is • How fast the model is: How long does it take to build a model, and how long does the model take to make predictions. • How scalable the model is
  • 32.
    Take care atcomplexity Making the same algorithm more complex increases the chance of overfitting. A model is more complex when: • It relies on more features to learn and predict (e.g. using two features vs ten features to predict a target) • It relies on more complex feature engineering (e.g. using polynomial terms, interactions, or principal components) • It has more computational overhead (e.g. a single decision tree vs. a random forest of 100 trees).
  • 33.
  • 34.
  • 35.
    Linear Regression Regression algorithmscan be used for example, when you want to compute some continuous value as compared to Classification where the output is categoric. • Time to go one location to another • Predicting sales of particular product next month • Impact of blood alcohol content on coordination • Predict monthly gift card sales and improve yearly revenue projections
  • 36.
    Logistic Regression Logistic regressionperforms binary classification, so the label outputs are binary. It takes linear combination of features and applies non-linear function (sigmoid) to it, so it’s a very small instance of neural network. • Predicting the Customer Churn • Credit Scoring & Fraud Detection • Measuring the effectiveness of marketing campaigns
  • 37.
    Decision trees Single treesare used very rarely, but in composition with many others they build very efficient algorithms such as Random Forest or Gradient Tree Boosting. • Investment decisions • Customer churn • Banks loan defaulters • Build vs Buy decisions • Sales lead qualifications
  • 38.
    K-means Sometimes you don’tknow any labels and your goal is to assign labels according to the features of objects. This is called clusterization task. If there are questions like how is this organized or grouping something or concentrating on particular groups etc. in your problem statement then you should go with Clustering. • When there is a large group of users and you want to divide them into particular groups based on some common attributes.
  • 39.
    Principal component analysis Principalcomponent analysis provides dimensionality reduction. Sometimes you have a wide range of features, probably highly correlated between each other, and models can easily overfit on a huge amount of data. Then, you can apply PCA.
  • 40.
    Support Vector Machines Support VectorMachine (SVM) is a supervised machine learning technique that is widely used in pattern recognition and classification problems —  when your data has exactly two classes. • detecting persons with common diseases such as diabetes • hand-written character recognition • text categorization — news articles by topics • stock market price prediction
  • 41.
    Neural networks Neural Networkstake in the weights of connections between neurons. Extremely complex models can be trained and they can be utilized as a kind of black box, without playing out an unpredictable complex feature engineering before training the model. Object recognition has been as of late enormously enhanced utilizing Deep Neural Networks. Applied to unsupervised learning tasks, such as feature extraction, deep learning also extracts features from raw images or speech with much less human intervention.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
    What you needto know ? It is mandatory to learn a programming language, preferably Python, along with the required analytical and mathematical knowledge. You can touch on: • Linear algebra for data analysis: Scalars, Vectors, Matrices, and Tensors • Mathematical Analysis: Derivatives and Gradients • Probability theory and statistics • Multivariate Calculus • Algorithms and Complex Optimizations
  • 47.
    Programming Languages Python ishands down the best programming language for Machine Learning applications. Other programming languages that could to use for Machine Learning Applications are R, C++, JavaScript, Java, C#, Julia, Shell, TypeScript, and Scala.
  • 48.
  • 49.
    Test Interactive: • http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ Test withAmazon SageMaker steps: 1. Create an AWS account 2. Launch a SageMaker Instance 3. Created a S3 bucket 4. Launch the Jupyter Notebook 5. Explore examples & tutorials
  • 50.
    Libraries • https://towardsdatascience.com/best-python-libraries-for-machine- learning-and-deep-learning-b0bd40c7e8c (Python) •https://blog.bitsrc.io/11-javascript-machine-learning-libraries-to-use- in-your-app-c49772cca46c (Javascript) • https://www.baeldung.com/java-ai (Java) • https://www.geeksforgeeks.org/machine-learning-in-c/ (C++)
  • 51.
  • 52.
    Nestor Co-founder & ChiefProduct Officer Raluca Apostol raluca@nestorup.com