This document provides an introduction to machine learning concepts including supervised and unsupervised learning. It discusses linear regression with one variable and multiple variables. For linear regression with one variable, it describes the hypothesis function, cost function, gradient descent algorithm, and makes predictions using a housing dataset. For multiple variables, it introduces feature normalization and applies the concepts to predict housing prices based on size, bedrooms and price in a real estate dataset. The document provides code examples to implement the algorithms.
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
This presentation discusses the following -
Possible strategies to follow when working on a new machine learning problem.
The common problems with classifiers (how to detect them and eliminate them).
Popular approaches on how to use binary classifiers to problems with multi class classification.
This Logistic Regression Presentation will help you understand how a Logistic Regression algorithm works in Machine Learning. In this tutorial video, you will learn what is Supervised Learning, what is Classification problem and some associated algorithms, what is Logistic Regression, how it works with simple examples, the maths behind Logistic Regression, how it is different from Linear Regression and Logistic Regression applications. At the end, you will also see an interesting demo in Python on how to predict the number present in an image using Logistic Regression.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. What is supervised learning?
2. What is classification? what are some of its solutions?
3. What is logistic regression?
4. Comparing linear and logistic regression
5. Logistic regression applications
6. Use case - Predicting the number in an image
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Machine Learning and Data Mining: 04 Association Rule MiningPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This lecture introduces association rule mining and the Apriori algorithm
Introduction to Bayesian classifier. It describes the basic algorithm and applications of Bayesian classification. Explained with the help of numerical problems.
Logistic regression : Use Case | Background | Advantages | DisadvantagesRajat Sharma
This slide will help you to understand the working of logistic regression which is a type of machine learning model along with use cases, pros and cons.
Machine Learning Interview Questions and Answers | Machine Learning Interview...Edureka!
** Machine Learning Training with Python: https://www.edureka.co/python **
This Machine Learning Interview Questions and Answers PPT will help you to prepare yourself for Data Science / Machine Learning interviews. This PPT is ideal for both beginners as well as professionals who want to learn or brush up their concepts in Machine Learning core-concepts, Machine Learning using Python and Machine Learning Scenarios. Below are the topics covered in this tutorial:
1. Machine Learning Core Interview Question
2. Machine Learning using Python Interview Question
3. Machine Learning Scenario based Interview Question
Check out our playlist for more videos: http://bit.ly/2taym8X
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
Binary Class and Multi Class Strategies for Machine LearningPaxcel Technologies
This presentation discusses the following -
Possible strategies to follow when working on a new machine learning problem.
The common problems with classifiers (how to detect them and eliminate them).
Popular approaches on how to use binary classifiers to problems with multi class classification.
This Logistic Regression Presentation will help you understand how a Logistic Regression algorithm works in Machine Learning. In this tutorial video, you will learn what is Supervised Learning, what is Classification problem and some associated algorithms, what is Logistic Regression, how it works with simple examples, the maths behind Logistic Regression, how it is different from Linear Regression and Logistic Regression applications. At the end, you will also see an interesting demo in Python on how to predict the number present in an image using Logistic Regression.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. What is supervised learning?
2. What is classification? what are some of its solutions?
3. What is logistic regression?
4. Comparing linear and logistic regression
5. Logistic regression applications
6. Use case - Predicting the number in an image
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Machine Learning and Data Mining: 04 Association Rule MiningPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This lecture introduces association rule mining and the Apriori algorithm
Introduction to Bayesian classifier. It describes the basic algorithm and applications of Bayesian classification. Explained with the help of numerical problems.
Logistic regression : Use Case | Background | Advantages | DisadvantagesRajat Sharma
This slide will help you to understand the working of logistic regression which is a type of machine learning model along with use cases, pros and cons.
Machine Learning Interview Questions and Answers | Machine Learning Interview...Edureka!
** Machine Learning Training with Python: https://www.edureka.co/python **
This Machine Learning Interview Questions and Answers PPT will help you to prepare yourself for Data Science / Machine Learning interviews. This PPT is ideal for both beginners as well as professionals who want to learn or brush up their concepts in Machine Learning core-concepts, Machine Learning using Python and Machine Learning Scenarios. Below are the topics covered in this tutorial:
1. Machine Learning Core Interview Question
2. Machine Learning using Python Interview Question
3. Machine Learning Scenario based Interview Question
Check out our playlist for more videos: http://bit.ly/2taym8X
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Fundamentals of Machine Learning Bootcamp - 24 Nov London 2014 Persontyle
Fundamentals of Machine Learning Bootcamp will take you through the conceptual and applied foundations of the subject. Topics covered will include Machine Learning theory, types of learning, techniques, models and methods. Labs are developed to practically learn how to use the R programming language and packages for applying the main concepts and techniques of Machine Learning.
For corporate bookings or to organize on-site training email hello@persontyle.comor call now +44 (0)20 3239 3141
www.persontyle.com
A legacy media organization with a nationwide audience recently released a mobile app in an attempt to capture audience share among listeners who access audio content through digital distribution channels. The team signed an NDA with this organization, and will refer to this partner as the “Broadcaster” within our published materials.
The Broadcaster’s app surfaces a stream of audio content to users. Users can hear one of two types of content.
(1) News-- including the top of the hour newscasts, local and national news, and stories from the Broadcaster’s flagship news programs.
(2) Podcast-- including podcasts created by the Broadcaster and also independently created content like “Another Round” from Buzzfeed.
In app, users can skip, thumbs-up, share, or search for content. The Broadcaster has provided user data gathered by this app to our team. In this paper, the team describes our work building a model that will allow the Broadcaster to determine, for any given user, at any given hour, whether the app should surface news or a podcast to the user.
Georgetown Data Science Certificate, Spring 2016
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...Brittne Kakulla, Ph.D.
Big Data Science analysis of economic drivers impacting US broadband development using Census data and State Broadband Initiative Broadband Map data from 2011-2014.
Georgetown Data Analytics - Team 1 Capstone ProjectMark Phillips
Capstone project presentation from Team 1, Fall 2014. Project utilized data analytics processes and tools to perform text and network analytics on a sample of twitter accounts and their publicly available data.
Georgetown Data Analytics Project (Team DC)Noah Turner
This is my team's project for the Georgetown University Certificate in Data Analytics Program. We looked at Washington, DC crime data, and conducted analysis looking at correlations between types of crime and neighborhood information provided by the US Census.
A gentle introduction to 2 classification techniques, as presented by Kriti Puniyani to the NYC Predictive Analytics group (April 14, 2011). To download the file please go here: http://www.meetup.com/NYC-Predictive-Analytics/files/
A short presentation for beginners on Introduction of Machine Learning, What it is, how it works, what all are the popular Machine Learning techniques and learning models (supervised, unsupervised, semi-supervised, reinforcement learning) and how they works with various Industry use-cases and popular examples.
Covers supervised learning and discriminative algorithms. Includes: Linear Regression, The LMS Algorithm, Probabalistic interpretations, Classification, Logistic Regression, Underfitting and Overfitting.
Statistical Inference Part II: Types of Sampling DistributionDexlab Analytics
This is an in-depth analysis of the way different types of sampling distribution works focusing on their specific functions and interrelations as part of the discussion on the theory of sampling.
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Calculus Application Problem #3 Name _________________________.docxhumphrieskalyn
Calculus Application Problem #3 Name __________________________________________
The Deriving Dead! Due at the beginning of class ______________________
Introduction: Imagine that you are one of many people at a “party” and that, unknown to everyone else, one
person was bitten by a zombie on the way to the party! How quickly will the “zombiepocalypse” spread, and
what are the chances that you will leave the party as a zombie? The objective of this activity is to create a
mathematical model that describes the spread of a disease (such as a zombie virus) in a closed environment, and
then apply calculus concepts to this mathematical model.
Collecting the Data:
Let’s collect some data from an activity that will simulate the spread of a communicable disease over a period of
time, divided into “stages”.
The number of people in our “closed environment” is ________________
1
1
131211109876543210
Number
of Total
Infected
Number
of Newly
Infected
Stage
Number
Applying Calculus to the Data:
1. Using the data from the chart, make a scatterplot of the "Stage Number" (in L1) vs. the "Number of Total
Infected" (in L2). Sketch the scatterplot below. Connect the data points to create a continuous function for Y(t).
2. Using the data that was collected in the activity, answer the following questions about the derivative
function Y’(t), which represents the instantaneous rate of change of the number of infected at any stage.
Consider the domain to be [ 0 , 13 ].
a. When, if ever, is Y’(t) positive? ____________________________________
b. When, if ever, is Y’(t) negative? ___________________________________
c. When, if ever, is Y’(t) increasing? ____________________________________
d. When, if ever, is Y’(t) decreasing? ____________________________________
e. From your answers above, sketch a graph of Y’(t) below.
f. The t-value where Y’(t) changes from increasing to decreasing is the inflection point on Y(t).
According to the data in the chart, this occurs when t = _________, and the corresonding “y-value”
is ________.
(Note: We will check this later in the problem!)
Finding a Logistic Function that Models the Data
3. Since the data (should) appear to be a model for a logistic function, we need to find a function in the
form:
Y(t ) =
c
1 + a ⋅ e− b⋅t
,
where t represents the stage number and Y(t) represents the total number of infected people in stage t.
Therefore, we need to find values for the three constants a, b, and c. The value of c should be easy. For our
activity,
c = _________
To find a, use the initial point ( 0 , 1 ). Substitute this ordered pair, with the value of c into our logistic model and
solve for a. Show your work below.
a = _______________
To find b, the last constant in the model, we need another ordered pair. Let’s use an ordered pair near the middle
of the data, say during Stage #7.
Record this ordered pair: ( 7 , _________)
Substitut ...
Data Analysis and Statistical inferenceShreyas G S
Analyzing relationship between two variables, whether they are dependent or independent variables in a General Social Survey data set using data analysis and inference techniques.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. Machine Learning
Page 1
INTRODUCTION
Machine Learning
"A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if its performance at tasks in T, as measured by P, improves
with experience E."
Where
E= The experience of playing many games of checkers.
T= Task of playing checkers.
P= The probability that the program will win the next game.
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output
should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into regression and classification problems. In
a regression problem, we are trying to predict results within a continuous output, meaning
that we are trying to map input variables to some continuous function. In a classification
problem, we are trying to predict results in a discrete output. In other words, we are trying
to map input variables into discrete categories.
Example:
Regression-Given data about the size of houses on the real estate market, try to predict their
price. Price as a function of size is a continuous output.
Classification-Bank have to decide whether or not they give loan to someone on this basis of
his credit history.
3. Machine Learning
Page 2
Unsupervised Learning
Unsupervised learning, on the other hand, allows us to approach problems with little or no
idea what our results should look like. We can derive structure from data where we don't
necessarily know the effect of the variables.
We can derive this structure by clustering the data based on relationships among the
variables in the data.
Example:
Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to
automatically group these essays into a small number that are somehow similar or related by
different variables, such as word frequency, sentence length, page count, and so on.
1. Linear Regression With One Variable
Linear regression with one variable is also known as "univariate linear regression."
Univariate linear regression is used when you want to predict a single output value from a
single input value. We're doing supervised learning here, so that means we already have an
idea what the input/output cause and effect should be.
The Hypothesis Function
Our hypothesis function has the general form: hθ(x)= θ0 + θ1x.
We are trying to create a function called hθ that is able to reliably map our input data(x’s) to
our output data(y’s).
Example:
X(input) y(output)
0 4
1 7
2 7
3 8
4. Machine Learning
Page 3
Now we can make a random guess about our hθ function: θ0=2 and θ1=2. The hypothesis
function becomes hθ(x)= 2 + 2x.
So for input of 1 to our hypothesis, y will be 4. This is off by 3.
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes
an average (actually a fancier version of an average) of all the results of the hypothesis with
inputs from x's compared to the actual output y's.
J(θ0,θ1) = (1/2m)∑ (𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)2
To break it apart, it is ½ 𝑥̅ where 𝑥̅ is the mean of the squares of hθ(xi) - yi, or the difference
between the predicted value and the actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The
mean is halved (1/2m) as a convenience for the computation of the gradient descent, as the
derivative term of the square function will cancel out the ½ term.
Now we are able to concretely measure the accuracy of our predictor function against the
correct results we have so that we can predict new results we don't have.
Gradient Descent
So we have our hypothesis function and we have a way of measuring how accurate it is. Now
what we need is a way to automatically improve our hypothesis function. That's where
gradient descent comes in.
Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we are
graphing the cost function for the combinations of parameters). We are not graphing x and y
itself, but the parameter range of our hypothesis function and the cost resulting from
selecting particular set of parameters.
We put θ0 on X axis and θ1 on Y axis, with the cost function on the vertical z axis. The points
on our graph will be the result of the cost function using our hypothesis with those specific
theta parameters.
We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum.
5. Machine Learning
Page 4
The way we do this is by taking the derivative (the line tangent to a function) of our cost
function. The slope of the tangent is the derivative at that point and it will give us a direction
to move towards. We make steps down that derivative by the parameter α, called the learning
rate.
The gradient descent equation is:
Repeat until convergence:
θj ∶= θj -α
𝜹
𝜹𝜽𝐣
J(θ0, θ1)
for j=0 and j=1.
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent
equation can be derived. We can substitute our actual cost function and our actual hypothesis
function and modify the equation.
Repeat until convergence: {
θ0 ∶= θ0-α
𝟏
𝒎
∑ (𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)
θ1 ∶= θ1-α
𝟏
𝒎
∑ ((𝒉𝜽(xi
𝒎
𝒊=𝟏 ) − 𝐲i)xi)
}
Where m is the size of the training set, θ0 a constant that will be changing simultaneously with
θ1 and xi and yi are values of the given training set.
We have separated out the two cases for θj and that for θ1 we are multiplying xi at the end
due to derivative.
The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more accurate.
6. Machine Learning
Page 5
Linear Regression With Multiple Variables
Linear regression with multiple variables is also known as "multivariate linear regression".
We now introduce notation for equations where we can have any number of input variables.
xj
(i) = value of j in the ith training example.
x(i) = the column vector of all the feature inputs of the ith training example.
m = number of training examples.
n = |x(i)| (the number of features)
Now we define the multivariable form of the hypothesis function as follows, accommodating
these multiple features:
hθ(x)= θ0 + θ1x1 + θ2x2 + θ3x3 +….+ θnxn
In order to have little intuition about this function, we can think about θ0 as the basic price of
a house, θ1 as the price per square metre, θ2 as the price per floor, etc. x1 will be the number
of square meters in the floor, x2 the number of floors in the house, etc.
Using the definition of matrix multiplication, our multivariable hypothesis function can be
concisely represented as:
hθ(x)= [ θ0 θ1 θ2 …. θn][
x0
⋮
xn
] = θTx
This is a vectorization of our hypothesis function for one training example.
Now we can collect all m training examples each with n features and record them in an n+1
by m matrix. In this matrix we let the value of the subscript (feature) also represent the row
number (except the initial row is the "zeroth" row), and the value of the superscript (the
training example) also represent the column number, as shown here:
X=[
x0(1) ⋯ x0(m)
⋮ ⋮ ⋮
xn(1) ⋯ xn(m)
]
Notice above that the first column is the first training example (like the vector above), the
second column is the second training example, and so forth.
Now we can define hθ(X) as a row vector that gives the value of hθ(x) at each of the m training
examples:
hθ(X) = θTX
7. Machine Learning
Page 6
If we store the training examples in X row wise, the hypothesis function can be calculated as:
hθ(X) = Xθ
Cost Function
For the parameter vector θ, the cost function is calculated as:
J(θ) = (1/2m)∑ (𝒉𝜽(x(i)𝒎
𝒊=𝟏 ) − 𝐲(i))2
Gradient Descent For Multiple Variables
The gradient descent equation itself is generally the same form; we just have to repeat it for
our 'n' features:
Repeat until convergence: {
θ0 ∶= θ0 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟎(i)
θ1 ∶= θ1 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟏(i)
θ2 ∶= θ0 - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝟐(i)
…..
}
In other words:
Repeat until convergence: {
θj ∶= θj - α
𝟏
𝒎
∑ (𝐡𝛉(𝐱(i)) − 𝐲(i)𝒎
𝒊=𝟏 ). 𝐱𝐣(i)
for j = 0….n
}
8. Machine Learning
Page 7
Feature Normalization
We can speed up gradient descent by having each of our input values in roughly the same
range. This is because θ will descend quickly on small ranges and slowly on large ranges, and
so will oscillate inefficiently down to the optimum when the variables are very uneven.
Two techniques to help with this are feature scaling and mean normalization. Feature scaling
involves dividing the input values by the range (i.e. the maximum value minus the minimum
value) of the input variable, resulting in a new range of just 1. Mean normalization involves
subtracting the average value for an input variable from the values for that input variable,
resulting in a new average value for the input variable of just zero. To implement both of
these techniques, adjust your input values as shown in this formula:
xi ∶=
𝒙𝒊−µ𝒊
𝒔𝒊
µi is the average of all the values and si is the standard deviation.
Predictive Analysis Using Machine Learning-1
In this part of Machine Learning, we will implement linear regression with one variable to
predict the outcome using Octave.
Problem
Suppose you are the CEO of a restaurant franchise and are considering different cities for
opening a new outlet. The chain already has trucks in various cities and you have data for
profits and populations from the cities. You would like to use this data to help you select
which city to expand to next.
Dataset
The file ex1data1.txt contains the dataset for our linear regression problem. The first column
is the population of a city and the second column is the profit of a food truck in that city. A
negative value for profit indicates a loss.
Plotting The Data
Before starting on any task, it is often useful to understand the data by visualizing it. For this
dataset, we can use a scatter plot to visualize the data, since it has only two properties to plot
(profit and population).
9. Machine Learning
Page 8
Code:
function plotData(X, y)
plot(X,y,'rx','MarkerSize',10);
ylabel('Profit in $10,000s');
xlabel('Population of City in 10,000s');
Gradient Descent
We will fit the linear regression parameters θ to our dataset using gradient descent.
Code:
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
10. Machine Learning
Page 9
delta = (1/m)*sum(X.*repmat((X*theta - y), 1, size(X,2)));
theta = (theta' - (alpha * delta))';
J_history(iter) = computeCost(X, y, theta);
end
Computing Cost
We will implement a function to calculate J(θ) so you can check the convergence of gradient
descent implementation.
Code:
function J = computeCost(X, y, theta)
m = length(y);
J = (1/(2*m))*sum(power((X*theta - y),2));
end
Plotting Linear Fit
We plot X variable values on x axis and X*theta values on y axis which gives us a linear fit from
which we can predict y values for unseen examples.
Code:
plot(X(:,2), X*theta, '-')
11. Machine Learning
Page 10
Working
A good way to verify that gradient descent is working correctly is to look at the value of J(θ)
and check that it is decreasing with each step. The starter code of gradient descent calls
compute cost on every iteration and prints the cost.
Assuming we have implemented gradient descent and compute cost correctly, our value of
J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
Prediction
The final values of θ will be used to make predictions on profits in areas of 35,000 and 70,000.
Code:
predict1 = [1, 35000] *theta;
predict2 = [1, 70000] * theta;
Visualizing Cost Function
To understand the cost function J(θ) better, we will now plot the cost over a 2-dimensional
grid of θ0 and θ1 values. A code is set up to calculate J(θ) over a grid of values using the
compute cost function.
Code:
theta0_vals = linspace(-10, 10, 100);
theta1_vals = linspace(-1, 4, 100);
J_vals = zeros(length(theta0_vals), length(theta1_vals));
for i = 1:length(theta0_vals)
for j = 1:length(theta1_vals)
t = [theta0_vals(i); theta1_vals(j)];
J_vals(i,j) = computeCost(X, y, t);
end
end
surf(theta0_vals, theta1_vals, J_vals)
contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20))
13. Machine Learning
Page 12
The purpose of these graphs is to show you that how J(θ) varies with changes in θ0 and θ1.
The cost function J(θ) is bowl-shaped and has a global minimum.
This minimum is the optimal point for θ0 and θ1, and each step of gradient descent moves
closer to this point.
Program Execution
14. Machine Learning
Page 13
Predictive Analysis Using Machine Learning-2
In this part, we will implement linear regression with multiple variables to predict the prices
of houses.
Problem
Suppose you are selling your house and you want to know what a good market price would
be. One way to do this is to first collect information on recent houses sold and make a model
of housing prices.
Dataset
The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first
column is the size of the house (in square feet), the second column is the number of
bedrooms, and the third column is the price of the house.
Feature Normalization
By examining the dataset values, note that house sizes are about 1000 times the number of
bedrooms.
When features differ by orders of magnitude, first performing feature scaling can make
gradient descent converge much more quickly.
The standard deviation is a way of measuring how much variation there is in the range of
values of a particular feature.
Code:
function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X);
sigma = std(X);
X_norm = (X - repmat(mu, [size(X,1), 1]))./repmat(sigma, [size(X,1), 1]);
end
15. Machine Learning
Page 14
Gradient Descent
Previously, we implemented gradient descent on a univariate regression problem. The only
difference now is that there is one more feature in the matrix X. The hypothesis function and
the batch gradient descent update rule remain unchanged.
Code:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y);
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
hy = X*theta-y;
hyr = repmat(hy, [1, size(X,2)]);
prodhyrx = hyr.*X;
sumVec = sum(prodhyrx)';
theta = theta - alpha.*sumVec./m;
J_history(iter) = computeCostMulti(X, y, theta);
end
end
Compute Cost
Code:
function J = computeCostMulti(X, y, theta)
m = length(y);
J = 0;
J = 1/(2*m) * (X * theta - y)' * (X * theta - y);
end
16. Machine Learning
Page 15
Convergence Graph
We plot a graph of cost J(θ) on Y axis against the number of iterations on the X axis. We
observe that the cost is reducing with each iteration towards the minimum. This minimum is
the optimal point for θ0 and θ1, and each step of gradient descent moves closer to this point.
Working
A good way to verify that gradient descent is working correctly is to look at the value of J(θ)
and check that it is decreasing with each step. The starter code of gradient descent calls
compute cost on every iteration and prints the cost.
Assuming we have implemented gradient descent and compute cost correctly, our value of
J(θ) should never increase, and should converge to a steady value by the end of the algorithm.
17. Machine Learning
Page 16
Prediction
We now predict the price of a house of an unseen example using the optimal values of theta
obtained from the gradient descent.
The price of a house with 1650 square meters having 3 bedrooms is:
Code:
X = [1650, 3];
X_norm = (X-mu)./sigma;
price = [1, X_norm]*theta;
Price: $293237.1614
Normal Equation
The closed-form solution to linear regression is:
θ = (XTX)-1XT
y
Using this formula does not require any feature scaling, and you will get an exact solution in
one calculation. There is no "loop until convergence" like in gradient descent.
Code:
function [theta] = normalEqn(X, y)
theta = zeros(size(X, 2), 1);
theta = pinv(X'*X)*X'*y;
end
Prediction
Code:
price = [1 1650 3]*theta;
19. Machine Learning
Page 18
2. Logistic Regression
Now we are switching from regression problems to classification problems. The name
‘Logistic Regression’ is an approach to classification problem.
Classification
Instead of our output vector y being a continuous range of values, it will only be 0 or 1.
Y ∈ {0,1}
Where 0 is usually taken as the "negative class" and 1 as the "positive class", but we are free
to assign any representation to it. We're only doing two classes for now, called a "Binary
Classification Problem."
One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all
less than 0.5 as a 0. This method doesn't work well because classification is not actually a
linear function. Therefore, we use a sigmoid function to classify.
Hypothesis Representation
Our hypothesis should satisfy:
0≤hθ(x)≤1
Our new form uses the sigmoid function, also called the “logistic function”:
hθ(x) = g(θTx)
Z= θTx
g(z) =
1
1+𝑒−𝑧
The function g(z), shown below, maps any real number to the (0, 1) interval, making it useful
for transforming an arbitrary-valued function into a function better suited for classification.
We start with our old hypothesis (linear regression), except that we want to restrict the range
to 0 and 1. This is accomplished by plugging θTx into the Logistic Function.
20. Machine Learning
Page 19
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis
function as follows:
hθ(x) ≥ 0.5 → y=1
hθ(x) < 0.5 → y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero,
its output is greater than or equal to 0.5:
g(z) ≥ 0.5
when z ≥ 0
Remember:
z=0, e0=1, g(z)=1/2
z →∞, e−∞ →0, g(z)=1
21. Machine Learning
Page 20
z→ −∞, e∞ →∞, g(z)=0
So if our input to g is θTx, then that means:
hθ(x) = g(θTx) ≥ 0.5
when θTx ≥ 0
From these statements we can now say:
θTx ≥ 0 → y = 1
θTx < 0 → y = 0
The decision boundary is the line that separates the area where y=0 and where y=1. It is
created by our hypothesis function.
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic
Function will cause the output to be wavy, causing many local optima. In other words, it will
not be a convex function.
Instead, our cost function for logistic regression looks like:
J(θ) = (1/m)∑ 𝐂𝐨𝐬𝐭(𝒉𝜽(x(i)𝒎
𝒊=𝟏 ) − 𝐲(i))
Cost(hθ(x),y) = −log(hθ(x)) if y = 1
Cost(hθ(x),y) = −log(1−hθ(x)) if y = 0
22. Machine Learning
Page 21
The more our hypothesis is off from y, the larger the cost function output. If our hypothesis
is equal to y, then our cost is 0:
Cost(hθ(x),y) = 0 if hθ(x) = y
Cost(hθ(x),y)→∞ if y = 0and hθ(x)→1
Cost(hθ(x),y)→∞ if y = 1and hθ(x)→0
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also
outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs
1. If our hypothesis approaches 0, then the cost function will approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic
regression.
Simplified Cost Function And Gradient Descent
We can compress our cost function's two conditional cases into one case:
Cost(hθ(x),y) = −ylog(hθ(x)) − (1−y)log(1−hθ(x))
Notice that when y is equal to 1, then the second term ((1−y)log(1−hθ(x))) will be zero and will
not affect the result. If y is equal to 0, then the first term (−ylog(hθ(x))) will be zero and will
not affect the result.
We can fully write out our entire cost function as follows:
J(θ) = -
𝟏
𝒎
∑ .𝒎
𝒊=𝟏 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i)))
Gradient Descent:
Repeat {
θj :=θj −
𝛼
𝑚
∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))x(i)j
}
23. Machine Learning
Page 22
Regularization
Regularization is designed to address the problem of overfitting.
High bias or underfitting is when the form of our hypothesis maps poorly to the trend of the
data. It is usually caused by a function that is too simple or uses too few features. Ex. if we
take hθ(x)=θ0+θ1x1+θ2x2 then we are making an initial assumption that a linear model will fit
the training data well and will be able to generalize but that may not be the case.
At the other extreme, overfitting or high variance is caused by a hypothesis function that fits
the available data but does not generalize well to predict new data. It is usually caused by a
complicated function that creates a lot of unnecessary curves and angles unrelated to the
data.
This terminology is applied to both linear and logistic regression.
There are two main options to address the issue of overfitting:
1. Reduce the number of features
2. Regularization – Keep all features, reduce parameters θj.
Regularization works well when we have a lot of slightly useful features.
Cost Function
If we have overfitting from our hypothesis function, we can reduce the weight that some of
the terms in our function carry by increasing their cost.
Say we wanted to make the following function more quadratic:
θ0+θ1x+θ2x2+θ3x3+θ4x4
We'll want to eliminate the influence of θ3x3 and θ4x4. Without actually getting rid of these
features or changing the form of our hypothesis, we can instead modify our cost function:
minθ
1
2𝑚
∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))2 + 1000⋅θ2
3+1000⋅θ2
4
We've added two extra terms at the end to inflate the cost of θ3 and θ4. Now, in order for
the cost function to get close to zero, we will have to reduce the values of θ3 and θ4 to near
zero. This will in turn greatly reduce the values of θ3x3 and θ4x4 in our hypothesis function.
We could also regularize all of our theta parameters in a single summation:
24. Machine Learning
Page 23
minθ
1
2𝑚
[ ∑ .𝑚
𝑖=1 (hθ(x(i))−y(i))2 + λ∑ (𝑛
𝑗=1 θ2
j)]
The λ, or lambda, is the regularization parameter. It determines how much the costs of our
theta parameters are inflated.
Using the above cost function with the extra summation, we can smooth the output of our
hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth
out the function too much and cause underfitting.
Regularized Linear Regression
We will modify our gradient descent function to separate out θ0 from the rest of the
parameters because we do not want to penalize θ0.
Repeat {
θ0 ∶= θ0 - α
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). x0(i)
θj ∶= θj – α[(
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). xj(i)) +
𝜆
𝑚
θj]
}
The term
𝜆
𝑚
θj performs our regularization.
Regularized Logistic Regression
We can regularize logistic regression in a similar way that we regularize linear regression. Let's
start with the cost function.
J(θ) = -
1
𝑚
∑ .𝑚
𝑖=1 [y(i)log(hθ(x(i))) + (1−y(i))log(1−hθ(x(i))) +
𝜆
2𝑚
∑ (𝑛
𝑗=1 θ2
j)
Gradient Descent:
Repeat {
θ0 ∶= θ0 - α
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). x0(i)
θj ∶= θj – α[(
1
𝑚
∑ (hθ(x(i)) − y(i)𝑚
𝑖=1 ). xj(i)) +
𝜆
𝑚
θj]
}
25. Machine Learning
Page 24
Predictive Analysis Using Machine Learning-3
In this part, we will build a logistic regression model to predict whether a student gets
admitted into a university.
Problem
Suppose that you are the administrator of a university department and you want to
determine each applicant's chance of admission based on their results on two exams.
Our task is to build a classification model that estimates an applicant's probability of
admission based the scores from those two exams.
Dataset
We have historical data from previous applicants that we use as a training set for logistic
regression. For each training example, we have the applicant's scores on two exams and the
admissions decision.
Visualizing The Data
We code the function plotData so that it displays a figure, where the axes are the two exam
scores, and the positive and negative examples are shown with different markers.
Code:
data = load('ex2data1.txt');
X = data(:, [1, 2]);
y = data(:, 3);
function plotData(X, y)
pos = find(y==1);
neg = find(y == 0);
plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, 'MarkerSize', 7);
plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);
end
plotData(X, y);
xlabel('Exam 1 score')
ylabel('Exam 2 score')
26. Machine Learning
Page 25
Cost Function and Gradient Descent
We will implement the cost function and gradient for logistic regression. Here we use these
functions only to calculate the initial cost and gradient using initial theta(zeros).
Code:
initial_theta = zeros(n + 1, 1);
function [J, grad] = costFunction(theta, X, y)
m = length(y);
J = 0;
grad = zeros(size(theta));
J = (1 / m) * sum( -y'*log(sigmoid(X*theta)) - (1-y)'*log( 1 - sigmoid(X*theta)) );
grad = (1 / m) * sum( X .* repmat((sigmoid(X*theta) - y), 1, size(X,2)) );
end
27. Machine Learning
Page 26
Sigmoid Function
We compute the sigmoid of each value of z (z can be a matrix, vector or scalar). This function
is called by other parts of the program during various stages.
Code:
function g = sigmoid(z)
g = zeros(size(z));
g = 1 ./ (1 + exp(-z));
end
Learning Parameters using fminunc
In the previous assignment, you found the optimal parameters of a linear regression model
by implementing gradient descent.
This time, instead of taking gradient descent steps, we will use an Octave built-in function
called fminunc. fminunc is an optimization solver that finds the minimum of an unconstrained
function.
Code:
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
We first define the options to be used with fminunc. Specifically, we set the GradObj option
to on, which tells fminunc that our function returns both the cost and the gradient. This allows
fminunc to use the gradient when minimizing the function. Furthermore, we set the MaxIter
option to 400, so that fminunc will run for at most 400 steps before it terminates.
To specify the actual function, we are minimizing, we use a "short-hand" for specifying
functions with the @(t) (costFunction(t, X, y)). This creates a function, with argument t, which
calls your costFunction. This allows us to wrap the costFunction for use with fminunc. fminunc
will converge on the right optimization parameters and return the final values of the cost and
θ.
Decision Boundary Plot
The final θ value obtained from fminunc will then be used to plot the decision boundary on
the training data.
Code:
function plotDecisionBoundary(theta, X, y)
plotData(X(:,2:3), y);
28. Machine Learning
Page 27
hold on
if size(X, 2) <= 3
plot_x = [min(X(:,2))-2, max(X(:,2))+2];
plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
axis([30, 100, 30, 100])
else
% Here is the grid range
u = linspace(-1, 1.5, 50);
v = linspace(-1, 1.5, 50);
z = zeros(length(u), length(v));
for i = 1:length(u)
for j = 1:length(v)
z(i,j) = mapFeature(u(i), v(j))*theta;
end
end
z = z';
contour(u, v, z, [0, 0], 'LineWidth', 2)
end
hold off
end
plotDecisionBoundary plots the data points X and y into a new figure with the decision
boundary defined by theta.
29. Machine Learning
Page 28
Prediction
After learning the parameters, we'll like to use it to predict the outcomes on unseen data. In
this part, we will use the logistic regression model to predict the probability that a student
with score 45 on exam 1 and score 85 on exam 2 will be admitted.
prob = sigmoid([1 45 85] * theta);
We compute the training accuracy using predict function. Predict computes the predictions
for X using a threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)
p = predict(theta, X);
m = size(X, 1);
p = zeros(m, 1);
p = sigmoid(X * theta)>=0.5 ;
end
fprintf('Train Accuracy: %fn', mean(double(p == y)) * 100);
31. Machine Learning
Page 30
3. Unsupervised Learning: Clustering
Unsupervised learning is contrasted from supervised learning because it uses an unlabelled
training set rather than a labelled one.
In other words, we don't have the vector y of expected results, we only have a dataset of
features where we can find structure.
Clustering is good for:
Market segmentation
Social network analysis
Organizing computer clusters
Astronomical data analysis
K-Means Algorithm
The K-Means Algorithm is the most popular and widely used algorithm for automatically
grouping data into coherent subsets.
The steps include:
1. Randomly initialize two points in the dataset called the cluster centroids.
2. Cluster assignment: assign all examples into one of two groups based on which cluster
centroid the example is closest to.
3. Move centroid: compute the averages for all the points inside each of the two cluster
centroid groups, then move the cluster centroid points to those averages.
4. Re-run (2) and (3) until we have found our clusters.
Our main variables are:
K (number of clusters)
Training set x(1), x(2),….,x(m)
Where x(i) ϵ IRn
Note that we will not use the x0=1 convention.
32. Machine Learning
Page 31
Algorithm:
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
Repeat:
for i = 1 to m:
c(i) := index (from 1 to K) of cluster centroid closest to x(i)
for k = 1 to K:
mu(k) := average (mean) of points assigned to cluster k
The first for-loop is the 'Cluster Assignment' step. We make a vector c where c(i) represents
the centroid assigned to example x(i).
We can write the operation of the Cluster Assignment step more mathematically as follows:
c(i) = argmink||x(i) - µk||
That is, each c(i) contains the index of the centroid that has minimal distance to x(i).
By convention, we square the right-hand-side, which makes the function we are trying to
minimize more sharply increasing.
The second for-loop is the 'Move Centroid' step where we move each centroid to the average
of its group.
The equation for this loop is as follows:
µk =
1
𝑛
[x(k1) + x(k2) + … + x(kn)] ϵ IRn
Where each of x(k1), x(k2),…, x(kn) are the training examples assigned to group μk.
If you have a cluster centroid with 0 points assigned to it, you can randomly re-initialize that
centroid to a new point. You can also simply eliminate that cluster group.
After a number of iterations, the algorithm will converge, where new iterations do not affect
the clusters.
Optimization Objective
Recall some of the parameters we used in our algorithm:
c(i) = index of cluster (1,2,...,K) to which example x(i) is currently assigned
33. Machine Learning
Page 32
μk = cluster centroid k (μk ∈ IRn)
μc
(i) = cluster centroid of cluster to which example x(i) has been assigned
Using these variables, we can define our cost function:
J(c(i), … , c(m), μ1, … , μK)==
1
𝑚
∑ .𝑚
𝑖=1 ||x(i) - μc
(i)||2
Our optimization objective is to minimize all our parameters using the above cost function.
That is, we are finding all the values in sets c, representing all our clusters, and μ, representing
all our centroids, that will minimize the average of the distances of every training example to
its corresponding cluster centroid.
The above cost function is often called the distortion of the training examples.
In the cluster assignment step, our goal is to:
Minimize J(…) with c(1),…,c(m) (holding μ1,…,μK fixed)
In the move centroid step, our goal is to:
Minimize J(…) with μ1,…,μK
With k-means, it is not possible for the cost function to sometimes increase. It should always
descend.
Random Initialization
There's one particular recommended method for randomly initializing the cluster centroids.
1. Have K<m. That is, make sure the number of your clusters is less than the number of
your training examples.
2. Randomly pick K training examples.
3. Set μ1,…,μk equal to these K examples.
K-means can get stuck in local optima. To decrease the chance of this happening, we can run
the algorithm on many different random initializations.
for i = 1 to 100:
randomly initialize k-means
run k-means to get 'c' and 'µ'
compute the cost function (distortion) J(c,µ)
pick the clustering that gave us the lowest cost
34. Machine Learning
Page 33
Choosing the Number of Clusters
Choosing K can be quite arbitrary and ambiguous.
The elbow method: plot the cost J and the number of clusters K. The cost function should
reduce as we increase the number of clusters, and then flatten out. Choose K at the point
where the cost function starts to flatten out.
J will always decrease as K is increased. The one exception is if k-means gets stuck at a bad
local optimum.
Another way to choose K is to observe how well k-means performs on a downstream purpose.
In other words, we choose K that proves to be most useful for some goal you're trying to
achieve from using these clusters.
Predictive Analysis Using Machine Learning-4
In this this part, we will implement the K-means algorithm and use it for image compression.
Problem
An image of resolution 128*128 pixels occupies 393,216 bits since each pixel is made up of 3
8-bit unsigned integer that represent RGB encoding, which specify the red, green, blue values
of each pixel. The image contains thousands of colours of varying RGB intensity. This causes
a lot of overhead on memory for storage.
The solution to this problem is to use less colours, in this case using K-means algorithm to
select the 16 colours that will be used to represent the compressed image.
Solution Intuition
In this problem, we only need to store the RGB values of the 16 selected colours, and for each
pixel in the image, we now need to only store the index of the colour at that location (where
only 4 bits are necessary to represent 16 possibilities).
The new representation requires some overhead storage in form of a dictionary of 16 colours,
each of which require 24 bits, but the image itself then only requires 4 bits per pixel location.
The final number of bits used is therefore 16*24 + 128*128*4 = 65,920 bits which corresponds
to compressing the original image by about a factor of 6.
35. Machine Learning
Page 34
K-means Implementation
The K-means algorithm is a method to automatically cluster similar data examples together.
The intuition behind K-means is an iterative procedure that starts by guessing the initial
centroids, and then refines this guess by repeatedly assigning examples to their closest
centroids and then re-computing the centroids based on the assignments.
Code:
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
idx = findClosestCentroids(X, centroids);
centroids = computeMeans(X, idx, K);
end
The first step in the loop is the cluster assignment step where each example is assigned to its
closest centroid. The second step is moving the centroid based on the means of all points
belonging to that particular cluster. The K-means algorithm will always converge to some final
set of means for the centroids. Note that the converged solution may not always be ideal and
depends on the initial setting of the centroids. Therefore, in practice the K-means algorithm
is usually run a few times with different random initializations.
Initializing centroids
A good strategy for initializing the centroids is to select random examples from the training
set.
Code:
function centroids = kMeansInitCentroids(X, K)
centroids = zeros(K, size(X, 2));
randidx = randperm(size(X, 1));
centroids = X(randidx(1:K), :)
end
Finding Closest Centroids
In the "cluster assignment" phase of the K-means algorithm, the algorithm assigns every
training example x(i) to its closest centroid, given the current positions of centroids.
Code:
function idx = findClosestCentroids(X, centroids)
36. Machine Learning
Page 35
K = size(centroids, 1);
idx = zeros(size(X,1), 1);
m = size(X,1)
for i = 1:m
distance_array = zeros(1,K);
for j = 1:K
distance_array(1,j) = sqrt(sum(power((X(i,:)-centroids(j,:)),2)));
end
[d, d_idx] = min(distance_array);
idx(i,1) = d_idx;
end
Computing Centroid means
Given assignments of every point to a centroid, the second phase of the algorithm
recomputes, for each centroid, the mean of the points that were assigned to it.
Code:
function centroids = computeCentroids(X, idx, K)
[m n] = size(X);
centroids = zeros(K, n);
for i = 1:K
c_i = idx==i;
n_i = sum(c_i);
c_i_matrix = repmat(c_i,1,n);
X_c_i = X .* c_i_matrix;
centroids(i,:) = sum(X_c_i) ./ n_i;
end
K-means Intuition
Below are three different plots which visually show how clusters are assigned and centroids
are moved using K-means algorithm.
38. Machine Learning
Page 37
K-means on Pixels
We load the image to be compressed:
A = imread('bird small.png');
This creates a three-dimensional matrix A whose first two indices identify a pixel position and
whose last index represents red, green, or blue.
After finding the top K = 16 colors to represent the image, we can now assign each pixel
position to its closest centroid using the findClosestCentroids function. This allows you to
represent the original image using the centroid assignments of each pixel.
We can view the effects of the compression by reconstructing the image based only on the
centroid assignments. Specifically, we can replace each pixel location with the mean of the
centroid assigned to it.
idx = findClosestCentroids(X, centroids);
X_recovered = centroids(idx,:); //Recover Image
Reshape the recovered image into proper dimensions:
X_recovered = reshape(X_recovered, img_size(1), img_size(2), 3);
41. Machine Learning
Page 40
Conclusion
Machine Learning has a wide area of applications such as Computer Vision, Natural Language
Processing, Bioinformatics, Stock Market Analysis, DNA sequence etc. Deep Learning is an
advancement in the field of Machine Learning. It’ll help us in discerning underlying patterns
and make better business decisions.
Bibliography
Andrew NG, Machine Learning, Stanford University.
Pang Ning Tan, Michael Steinbach, Vipin Kumar, Data Mining.
Susheela Devi, Pattern Recognition, IISC.
Author
My name is Shreyas GS. I’m very passionate about Data Analytics and Machine Learning. My
career goal is to be a Data Scientist. I’m also an avid reader of books and my other interests
include playing guitar, chess and football. I’m an eclectic learner with a keen interest in
astronomy.