ICT 3202 - INTRODUCTION
TO DATA SCIENCE
BY
ENGR. JOHNSON C. UBAH
B.ENG, M.ENG, HCNA, ASM
Machine Learning and Statistics
Machine learning is the practice of programming computers to learn from
data.
Machine learning is a subfield of artificial intelligence (AI). The goal of
machine learning generally is to understand the structure of data and fit
that data into models that can be understood and utilized by people.
In machine learning, data is referred to as called training sets or
examples.
Intro. To Machine Learning
Machine learning differs from traditional computational approaches because;
Traditional computing algorithms are sets of steps followed by computers to
solve problems.
Machine learning algorithms allows computers to train on data inputs and use
statistical analysis in order to generate output values that falls within specific
range.
Why Machine Learning?
Lets assume you’d like to write a filter program without using machine learning
methods. The steps would be;
You’d take a look at what spam e-mails looks like
You’d write an algorithm to detect the patterns that you’ve seen and the
software would then flag the e-mails as spam
Finally, you’d test the program, and redo the first two steps again until the results
are good enough.
Why Machine learning?
This program contains very long list of rules and hence
difficult to maintain. But if done with machine learning, you
will be able to maintain it properly.
Programs that uses ML techniques will automatically detect
changes by users, and update their definition automatically.
Why Machine Learning?
Machine Learning algorithm with automatic update when users change preference
When to use machine learning
When you have a problem that requires many rules to find the
solution.
Very complex problems for which there is no solution with
traditional approach.
Non-stable environments: machine learning software can adapt to
new data.
Classification of ML
There are types of machine learning systems. We can divide them into
categories, depending on whether;
1. They have been trained with humans or not
◦ Supervised
◦ Unsupervised
◦ Semi-supervised
◦ Reinforcement learning
2. If they can learn incrementally
3. If they work simply by comparing new data points to find data points or can
detect new patterns in the data, and then will build a model.
Supervised and unsupervised learning
We can classify machine learning systems according to the type
and amount of human supervision during the training. They are;
◦ Supervised learning
◦ Unsupervised learning
◦ Semi-supervised learning
◦ Reinforced learning.
Supervised learning
When an algorithm learns from example data and associated target
responses that can consist of numeric values or string labels, such as
classes or tags, in order to later predict the correct response when
posed with new examples comes under the category of Supervised
learning.
This approach is indeed similar to human learning under the
supervision of a teacher.
Tasks carried out by supervised learning
Supervised learning groups together a task of
classification. The program is a good example of this
because it’s been trained with many emails at the same
time as their class.
Another example is to predict a numeric value like the
price of a flat, given a set of features (location, number
of rooms, facilities) called predictors; this task is called
regression.
Supervised learning algorithms
You should keep in mind that some regression algorithms can be
used for classification as well, and vise versa.
Some important supervised algorithms
◦ K-nearest neighbors
◦ Linear regression
◦ Neural network
◦ Support vector machines
◦ Logistic regression
◦ Decision trees and random forest
Unsupervised learning
Unsupervised learning occurs when an algorithm learns from plain examples
without any associated response, leaving to the algorithm to determine the data
patterns on its own.
This type of algorithm tends to restructure the data into something else, such as
new features that may represent a class or a new series of un-correlated values.
They are quite useful in providing humans with insights into the meaning of data
and new useful inputs to supervised machine learning algorithms.
Unsupervised learning
As a kind of learning, it resembles the methods humans use to figure
out that certain objects or events are from the same class, such as by
observing the degree of similarity between objects. Some
recommendation systems that you find on the web in the form of
marketing automation are based on this type of learning.
In this type of learning the data is unlabeled.
Unsupervised learning algorithms
Some unsupervised learning algorithms includes;
◦Clustering: k-means, hierarchical cluster analysis
◦Association rule learning: Eclat, apriori
◦Visualization and dimensionality reduction: kernel PCA, t-
distribution, PCA
Examples of unsupervised learning
suppose you’ve got many data on visitor, you can use one
algorithm to detect groups with similar visitors. 65% of your
visitors might be males who love watching movie in the
evening, while 30% watch plays in the evening: Using
clustering algorithm, we have the smaller groups.
Secondly, for visualization algorithms, you will need to give
them many data and unlabeled data as input, and then you
will get 2D or 3D visualization as an output. Feature
extraction takes place here.
Reinforcement learning
An Agent “AI system” will observe the
environment, performs given actions, and
then receive rewards in return.
Here, the agent must learn by itself.
You can find this type of learning in many
robotics applications that learns how to
walk.
Semi-supervised learning
where an incomplete training signal is given: a training set
with some (often many) of the target outputs missing.
There is a special case of this principle known as
Transduction where the entire set of problem instances is
known at learning time, except that part of the targets are
missing.
Bad and Insufficient quantity of Training
Data
Machine learning systems are not like children,
who can distinguish apples and oranges in all
sorts of colors and shapes, but they require lot of
data to work effectively, whether you’re working
with very simple programs and problems, or
complex applications like image processing and
speech recognition.
Poor Quality Data
If you are working with training data that is full of errors and
outliers, this will make it very hard for the system to detect
patterns, so it won’t work properly.
So, if you want your program to work well, you must spend
more time cleaning up your training data.
Irrelevant features
The system will only be able to learn if the training data contains enough features
and data that aren’t too irrelevant. The most important part of any ML project is to
develop good features. “feature engineering”
Feature engineering follows this process:
◦ Feature selection: selecting the most useful features
◦ Feature extraction: combining existing features to provide more useful features.
◦ Creation of new features: creation of new features, based on data.
Testing
To ensure your model is working well and that models can generalize
with new cases, you can try out new cases with it by putting the
model in the environment and then monitoring how it will perform.
This is good practice.
You should divide your data into two set, one for training and the
second for testing.
Testing
The generalization error is the rate of error by evaluation of your model on the
test set. The value you get will tell you if your model is good enough, and if it will
work properly.
If the error rate is low, the model is good and will perform properly and vice
versa.
It is advisable to use 80% of your data for training and 20% for testing
Overfitting the data
Overgeneralization in machine learning is called “overfitting”.
Overfitting occurs when the model is very complex for the amount of
training data given.
Solution
Gather more data for “training data”
Reduce the noise level
Select one with fewer parameters
Under-fitting the data
This the opposite of overfitting. You will encounter this when the model is very
simple to learn.
For example, using the example of quality of life, real life is more complex than
your model, so the predictions won’t yield the same, even in the training
examples.
Solution:
◦ Select the most powerful model, which has many parameters
◦ Feed the best features into your algorithms. Here, I’m referring to feature
engineering
◦ Reduce the constraints on your model
Underfitting
Software for this course
Python’s popularity may be due to the increased development of deep learning
frameworks available for this language recently, including TensorFlow, PyTorch,
and Keras. As a language that has readable syntax and the ability to be used as a
scripting language, Python proves to be powerful and straightforward both for
preprocessing data and working with data directly. The scikit-learn machine
learning library is built on top of several existing Python packages that Python
developers may already be familiar with, namely NumPy, SciPy, and Matplotlib.
Software for this course
MATLAB makes machine learning easy. With tools and functions for
handling big data, as well as apps to make machine learning accessible,
MATLAB is an ideal environment for applying machine learning to your
data analytics.
With MATLAB, engineers and data scientists have immediate access to
prebuilt functions, extensive toolboxes, and specialized apps
for classification, regression, and clustering.

introduction to machine learning

  • 1.
    ICT 3202 -INTRODUCTION TO DATA SCIENCE BY ENGR. JOHNSON C. UBAH B.ENG, M.ENG, HCNA, ASM
  • 2.
  • 3.
    Machine learning isthe practice of programming computers to learn from data. Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning generally is to understand the structure of data and fit that data into models that can be understood and utilized by people. In machine learning, data is referred to as called training sets or examples.
  • 4.
    Intro. To MachineLearning Machine learning differs from traditional computational approaches because; Traditional computing algorithms are sets of steps followed by computers to solve problems. Machine learning algorithms allows computers to train on data inputs and use statistical analysis in order to generate output values that falls within specific range.
  • 5.
    Why Machine Learning? Letsassume you’d like to write a filter program without using machine learning methods. The steps would be; You’d take a look at what spam e-mails looks like You’d write an algorithm to detect the patterns that you’ve seen and the software would then flag the e-mails as spam Finally, you’d test the program, and redo the first two steps again until the results are good enough.
  • 6.
    Why Machine learning? Thisprogram contains very long list of rules and hence difficult to maintain. But if done with machine learning, you will be able to maintain it properly. Programs that uses ML techniques will automatically detect changes by users, and update their definition automatically.
  • 7.
    Why Machine Learning? MachineLearning algorithm with automatic update when users change preference
  • 8.
    When to usemachine learning When you have a problem that requires many rules to find the solution. Very complex problems for which there is no solution with traditional approach. Non-stable environments: machine learning software can adapt to new data.
  • 9.
    Classification of ML Thereare types of machine learning systems. We can divide them into categories, depending on whether; 1. They have been trained with humans or not ◦ Supervised ◦ Unsupervised ◦ Semi-supervised ◦ Reinforcement learning 2. If they can learn incrementally 3. If they work simply by comparing new data points to find data points or can detect new patterns in the data, and then will build a model.
  • 11.
    Supervised and unsupervisedlearning We can classify machine learning systems according to the type and amount of human supervision during the training. They are; ◦ Supervised learning ◦ Unsupervised learning ◦ Semi-supervised learning ◦ Reinforced learning.
  • 12.
    Supervised learning When analgorithm learns from example data and associated target responses that can consist of numeric values or string labels, such as classes or tags, in order to later predict the correct response when posed with new examples comes under the category of Supervised learning. This approach is indeed similar to human learning under the supervision of a teacher.
  • 13.
    Tasks carried outby supervised learning Supervised learning groups together a task of classification. The program is a good example of this because it’s been trained with many emails at the same time as their class. Another example is to predict a numeric value like the price of a flat, given a set of features (location, number of rooms, facilities) called predictors; this task is called regression.
  • 14.
    Supervised learning algorithms Youshould keep in mind that some regression algorithms can be used for classification as well, and vise versa. Some important supervised algorithms ◦ K-nearest neighbors ◦ Linear regression ◦ Neural network ◦ Support vector machines ◦ Logistic regression ◦ Decision trees and random forest
  • 15.
    Unsupervised learning Unsupervised learningoccurs when an algorithm learns from plain examples without any associated response, leaving to the algorithm to determine the data patterns on its own. This type of algorithm tends to restructure the data into something else, such as new features that may represent a class or a new series of un-correlated values. They are quite useful in providing humans with insights into the meaning of data and new useful inputs to supervised machine learning algorithms.
  • 16.
    Unsupervised learning As akind of learning, it resembles the methods humans use to figure out that certain objects or events are from the same class, such as by observing the degree of similarity between objects. Some recommendation systems that you find on the web in the form of marketing automation are based on this type of learning. In this type of learning the data is unlabeled.
  • 17.
    Unsupervised learning algorithms Someunsupervised learning algorithms includes; ◦Clustering: k-means, hierarchical cluster analysis ◦Association rule learning: Eclat, apriori ◦Visualization and dimensionality reduction: kernel PCA, t- distribution, PCA
  • 18.
    Examples of unsupervisedlearning suppose you’ve got many data on visitor, you can use one algorithm to detect groups with similar visitors. 65% of your visitors might be males who love watching movie in the evening, while 30% watch plays in the evening: Using clustering algorithm, we have the smaller groups. Secondly, for visualization algorithms, you will need to give them many data and unlabeled data as input, and then you will get 2D or 3D visualization as an output. Feature extraction takes place here.
  • 19.
    Reinforcement learning An Agent“AI system” will observe the environment, performs given actions, and then receive rewards in return. Here, the agent must learn by itself. You can find this type of learning in many robotics applications that learns how to walk.
  • 20.
    Semi-supervised learning where anincomplete training signal is given: a training set with some (often many) of the target outputs missing. There is a special case of this principle known as Transduction where the entire set of problem instances is known at learning time, except that part of the targets are missing.
  • 21.
    Bad and Insufficientquantity of Training Data Machine learning systems are not like children, who can distinguish apples and oranges in all sorts of colors and shapes, but they require lot of data to work effectively, whether you’re working with very simple programs and problems, or complex applications like image processing and speech recognition.
  • 22.
    Poor Quality Data Ifyou are working with training data that is full of errors and outliers, this will make it very hard for the system to detect patterns, so it won’t work properly. So, if you want your program to work well, you must spend more time cleaning up your training data.
  • 23.
    Irrelevant features The systemwill only be able to learn if the training data contains enough features and data that aren’t too irrelevant. The most important part of any ML project is to develop good features. “feature engineering” Feature engineering follows this process: ◦ Feature selection: selecting the most useful features ◦ Feature extraction: combining existing features to provide more useful features. ◦ Creation of new features: creation of new features, based on data.
  • 24.
    Testing To ensure yourmodel is working well and that models can generalize with new cases, you can try out new cases with it by putting the model in the environment and then monitoring how it will perform. This is good practice. You should divide your data into two set, one for training and the second for testing.
  • 25.
    Testing The generalization erroris the rate of error by evaluation of your model on the test set. The value you get will tell you if your model is good enough, and if it will work properly. If the error rate is low, the model is good and will perform properly and vice versa. It is advisable to use 80% of your data for training and 20% for testing
  • 26.
    Overfitting the data Overgeneralizationin machine learning is called “overfitting”. Overfitting occurs when the model is very complex for the amount of training data given. Solution Gather more data for “training data” Reduce the noise level Select one with fewer parameters
  • 27.
    Under-fitting the data Thisthe opposite of overfitting. You will encounter this when the model is very simple to learn. For example, using the example of quality of life, real life is more complex than your model, so the predictions won’t yield the same, even in the training examples. Solution: ◦ Select the most powerful model, which has many parameters ◦ Feed the best features into your algorithms. Here, I’m referring to feature engineering ◦ Reduce the constraints on your model
  • 28.
  • 29.
    Software for thiscourse Python’s popularity may be due to the increased development of deep learning frameworks available for this language recently, including TensorFlow, PyTorch, and Keras. As a language that has readable syntax and the ability to be used as a scripting language, Python proves to be powerful and straightforward both for preprocessing data and working with data directly. The scikit-learn machine learning library is built on top of several existing Python packages that Python developers may already be familiar with, namely NumPy, SciPy, and Matplotlib.
  • 30.
    Software for thiscourse MATLAB makes machine learning easy. With tools and functions for handling big data, as well as apps to make machine learning accessible, MATLAB is an ideal environment for applying machine learning to your data analytics. With MATLAB, engineers and data scientists have immediate access to prebuilt functions, extensive toolboxes, and specialized apps for classification, regression, and clustering.