This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Different Models Used In Time Series - InsideAIMLVijaySharma802
We were working for the project Godrej Nature’s Basket, trying to manage its supply chain and delivery partners and would like to accurately forecast the sales for the period starting from “1st January 2019 to 15th January 2019”
Checkout for more articles: https://insideaiml.com/articles
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Different Models Used In Time Series - InsideAIMLVijaySharma802
We were working for the project Godrej Nature’s Basket, trying to manage its supply chain and delivery partners and would like to accurately forecast the sales for the period starting from “1st January 2019 to 15th January 2019”
Checkout for more articles: https://insideaiml.com/articles
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
Air traffic forecast serves as an important quantitative basis for airport planning - in particular for capacity planning CAPEX ,as well as for aeronautical and non-aeronautical revenue planning. High level decisions and planning in airports relies heavilly on future airport activity.
Intro and maths behind Bayes theorem. Bayes theorem as a classifier. NB algorithm and examples of bayes. Intro to knn algorithm, lazy learning, cosine similarity. Basics of recommendation and filtering methods.
18 css101j pps unit 2
Relational and logical Operators - Condition Operators, Operator Precedence - Expressions with pre / post increment operator - Expression with conditional and assignment operators - If statement in expression - L value and R value in expression -
Control Statements – if and else - else if and nested if, switch case - Iterations, Conditional and Unconditional branching
For loop - while loop - do while, goto, break, continue
Array Basic and Types - Array Initialization and Declaration - Initialization: one Dimensional Array - Accessing, Indexing one Dimensional Array Operations - One Dimensional Array operations - Array Programs – 1D
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
Air traffic forecast serves as an important quantitative basis for airport planning - in particular for capacity planning CAPEX ,as well as for aeronautical and non-aeronautical revenue planning. High level decisions and planning in airports relies heavilly on future airport activity.
Intro and maths behind Bayes theorem. Bayes theorem as a classifier. NB algorithm and examples of bayes. Intro to knn algorithm, lazy learning, cosine similarity. Basics of recommendation and filtering methods.
18 css101j pps unit 2
Relational and logical Operators - Condition Operators, Operator Precedence - Expressions with pre / post increment operator - Expression with conditional and assignment operators - If statement in expression - L value and R value in expression -
Control Statements – if and else - else if and nested if, switch case - Iterations, Conditional and Unconditional branching
For loop - while loop - do while, goto, break, continue
Array Basic and Types - Array Initialization and Declaration - Initialization: one Dimensional Array - Accessing, Indexing one Dimensional Array Operations - One Dimensional Array operations - Array Programs – 1D
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
linear regression is a linear approach for modelling a predictive relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables), which are measured without error. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications.[4] This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
If the goal is error reduction in prediction or forecasting, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.
Introduction to use machine learning in python and pascal to do such a thing like train prime numbers when there are algorithms in place to determine prime numbers. See a dataframe, feature extracting and a few plots to re-search for another hot experiment to predict prime numbers.
Introduction to Data Analytics starting with
OLS.
This is the first of a series of essays. I will share essays on unsupervised learning, dimensionality reduction and anomaly/outlier detection.
This presentation educates you about R - Nonlinear Least Square with Following the description of parameters using syntax and example program with the chart.
For more topics stay tuned with Learnbay.
The Annual G20 Scorecard – Research Performance 2019 bhavesh lande
The 2019 G20 Summit takes place in Osaka, Japan
on June 28-29. What happens in the G20 affects
the world and the G20 group is undoubtedly
a driver in the global research system.
information control and Security systembhavesh lande
Get an overview of threats to the Organization
• Learn about technologies for handling Security
• Get an overview of wireless technology
• Understand managing security
information technology and infrastructures choicesbhavesh lande
Understand information technology infrastructures
• Get an overview on infrastructure decision
• Learn about infrastructure components
• Understand infrastructure solutions
• Know what are the main drivers of innovation in IT
• Understand how innovations spread
• •Understand impact of firms innovating with IT
• •Get an overview of some recent innovations driven by IT
1.Industry transformation
2.Diversity and variety
3.Personalisation
4.Experimentation
5.Plug-and-play innovations
6.New marketing opportunities
7.Use of smart technologies
8.Natural language interfaces
9.Analytics
10.Crowd sourcing
database application using SQL DML statements: Insert, Select, Update, Delet...bhavesh lande
Design at least 10 SQL queries for suitable database
application using SQL DML statements:
Insert, Select, Update, Delete with operators, functions, and set operator.
WHY
WHERE
HOW
WHEN
WHO
FOR WHAT
Defining Data Science
• What Does a Data Science Professional Do?
• Data Science in Business
• Use Cases for Data Science
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
1. Working With Python
Algorithm Implementations In Python
The algorithms involved in machine learning and data science has two vital types of
implementation:
• Classification
• Regression
We will study and analyze some algorithms from both these types and understand how they
accelerate the process of nurturing the data and bring important insights from them.
Linear Regression
Linear regression comes under predictive analysis and is used to find the relationship between
two variables. These two variables are the target variable and the predictor variable. The
dependent variable is the target variable and the independent variable is the predictor variable.
Both of these variables are features that already exist in a dataset.
The overall concept of regression is to check two things- does the given group of predictor
variables do a satisfactory job in predicting the dependent variable? And which variables, in
particular, are the real predictors of the dependent variable, and what is the impact the outcome
variable?
Linear regression is represented by a simple equation-
Y = b*x+c
Where Y equals to a dependent variable, b is the regression coefficient, x is the slope and c is
the constant.
The Line of Best Fit
The line of best fit is a line which demonstrates the correlation between the observed or actual
values against the predicted ones. After applying the linear regression algorithm to our data, we
use this line to check how close the predicted values are to the actual ones. It helps in reducing
the distance between both those values also pronounced as the error values. They are also
referred to as residuals. These residuals are symbolized by the vertical lines showing the
comparison between the predicted and actual values.
2. For example, we can see that the weight of a person increases with an increase in their age.
Therefore, the blue line represents our line of best fit which is also known as the regression line.
For calculating the distance between the line and the points, we need the following formula
SS(residual)= ∑[h(x)-y]^2
where h(x) is the predicted value and y is the actual value
The Cost Function
Let us consider an example to understand this case. A sales department of a company is
planning to invest some capital to increase its sales in the next 6 months. But, they couldn't hit
their targets and had to incur some loss. Hence, to minimize that loss, we use the cost function.
This cost function is applied to represent and calculate the error of the model.
Therefore, cost function, J(Θ0, Θ1) = 1/2m∑[h(x)-y]^2, where and x is the number of rows in the
training set.
Gradient Descent
Gradient Descent is yet another important term which is used to find the minimalistic cost of a
function or an equation. It is by far the best optimization algorithm incorporated in machine
learning and deep learning. Based on a convex function, this descent makes some small tweaks
and changes to its parameters iteratively in order to minimize a given function to a local
minimum if possible.
Gradient Descent can be imagined as climbing down to the bottom of a mountain, instead of
climbing up. This is because it is a minimization technique used to minimize a given local
function.
Code in python
3. # Importing the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)
# Performing feature scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
# Fitting the Simple Linear Regression model to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
# Test set results prediction
y_pred = regressor.predict(x_test)
Logistic Regression
Logistic regression is a field of statistics that come under classification rather than regression.
Like all regression techniques, the logistic regression comes under predictive analysis theory of
implementation. Logistic regression is used to describe the structure of data and explain the
correlation between a dependent binary variable and one or more nominal independent
variables.
It is favorable for predicting binary outcomes as 1/0 or yes/no or true/false considering the kind
of dataset given and the output required. Logistic regression can also be considered as a
special case of linear regression when the outcome variable is categorical, where we are using
log of odds as the dependent variable. In simple words, it predicts the probability of occurrence
of an event by fitting data to a logit function.
This type of regression can be characterized by probabilities of following events-
Odds = p/(1-p) = probability of event occurring/probability of event not occurring
Ln (odds) = ln (p/(1-p))
Logit (p) = ln (p/(1-p))
4. In this, (p/1-p) is the odds ratio. If the log of the odd-oriented ratio is positive, the probability of
success rate will always be higher than 50%. A typical logistic model plot is shown below. It is
observed that the probability never goes below 0 and above 1.
We can check the performance of this regression by testing it through the following parameters.
Akaike Information Criteria- AIC is the measure of fitness which can penalize a model for the
frequency of its model coefficients. Therefore, we always prefer the model with minimum
Alkaline Information criteria value for better results.
Null Deviance- Null Deviance represents the outcome predicted by a model with the help of the
intercept. It all depends, if the null deviance is less, then the model will be better.
Residual Deviance- Residual deviance describes the response predicted by a model on the
addition of independent variables. Same goes for residual deviance, lower the value, better the
results.
Confusion Matrix- Confusion matrix is the tabular representation of actual vs predicted values.
It helps in finding the performance of a machine learning model, either classification or
regression and avoids overfitting.
Predicted Values
Actual Values
True Positive False Positive
False Positive True Negative
The accuracy of a model can be calculated by
True Positive(TP) + True Negative(TN)
True Positive(TP) + True Negative(TN) + False Positive(FP) + False Negative(FN)
5. ROC curve
Receiver operating characteristic curve or ROC curve signifies how well the model can
distinguish between two things by plotting the true positive rate with the false positive rate. Good
models will be able to accurately distinguish between the two. Whereas, a poor model will have
difficulties in differentiating between the two.
Code in python
# Importing the necessary libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import scipy
from scipy import stats
from scipy.stats import spearmanr
# Retrieving the dataset
t1= 'C:/Users/ml/datasets/train.csv'
train=pd.read_csv(t1)
t2= 'C:/Users/ml/datasets/test.csv'
test=pd.read_csv(t2)
x= train.iloc[:, [2,4,5,6,7,9]].values
y= train.iloc[:, 1].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, X_test, y_train, y_test= train_test_split(x,y,test_size= 0.25, random_state=0)
# Performing feature scaling
from sklearn.preprocessing import StandardScaler
sc_x=StandardScaler()
6. x_train=sc_x.fit_transform(x_train)
x_test=sc_x.transform(x_test)
# Fitting the Simple Linear Regression model to the training set
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state = 0)
classifier.fit(x_train,y_train)
# Test set results prediction
y_pred=classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
Support Vector Machines
Support Vector Machines(SVMs) are used to find the best hyperplane in an array of data points
that will best suit the results in a supervised learning environment. Suppose we have got two
columns x and y and they consist of some random data-points. These points are plotted in a
two-dimensional plane. Our motive is to derive a line that is going to separate these points.
The line that separates these points horizontally, vertically or diagonally is known as a
hyperplane. This hyperplane calculates the distance between the data points and itself to
determine the appropriate hyperplane which will enable in classifying these points. This distance
is known as margin.
SVM supports both regression and classification tasks and can tackle multiple continuous and
categorical variables. For categorical variables, a dummy variable is created with case values
as either 0 or 1. Thus, a categorical dependent variable consisting of three levels, say A, B, C, is
represented by a set of three dummy variables
7. A: {1 0 0}
B: {0 1 0}
C: {0 0 1}
As we all know how to identify a hyperplane, the question is how to identify the right one?
We can reach a conclusion by considering the following cases.
CASE 1
There are three hyper-planes in our n-dimensional space which are x1, x2, x3. We need to
identify the right hyperplane between the three. X1 and x3 are traversing between the points
while x2 is separating these points in a perfect fashion. Hence, x3 is our ideal hyper-plane.
CASE 2
The three hyperplanes x1, x2 and x3 are segregating the points quite well as they are all vertical
and parallel to each other. So, how can we identify the right hyperplane in this situation? x1 and
x3 are planes which are nearer to the points that mean their margins are quite small compared
to x3. Hence, x3 is having more margin and hence it is the ideal hyperplane.
CASE 3
In the third case, all the points are residing very close to each other in the center of the plane
with little or no room for the hyperplane to pass between them. What can we do in such a case?
This problem can be dealt with by adding a third axis, the Z-axis! As z is x^2 + y^2, all the
values for z will be positive as z is the squared sum of both x and y. Sometimes, this trick won't
be applicable to this type of scenario. Hence, kernel trick comes into play for such scarcity. It
converts the not so separable problem(the scenario discussed above) to a separable problem.
These functions are called kernels. They are useful in non-linear separation problem. Simply
put, it does some extremely complex data transformations, then finds out the process to
separate the data based on the labels or outputs that have been defined.
Code in python
# Importing the important libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
8. # Fitting the SVM model to the training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(x_train, y_train)
# Test set results prediction
y_pred = classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Decision Trees
Decision trees are the most preferred and favored machine learning classification technique in
machine learning. It not only helps us with the prediction analysis but also is a very efficient
algorithm to understand the characteristics of various variables. They come under the
supervised learning algorithm consisting of a predefined target variable which is to be
determined. This is suited for both categorical as well as continuous variables in the output.
The basic functioning of decision trees goes this way- there are a set of points that are plotted
on a plane. These points can’t be separated easily by a line due to their heterogeneous
properties. Hence, decision trees divide these points into different clusters or leaves based on
some predefined criteria and take care of them individually.
There are two different types of decision trees which are classified based on the type of target
variable we have taken.
Binary Variable Decision Tree- The decision tree which has a binary target variable is known
as Binary Variable Decision Tree. In this case, the output will be either “yes” or “no”.
Continuous Variable Decision Tree- The decision tree which has a continuous target variable
is known as Continuous Variable Decision Tree. In this case, the output will be any recurring
value such as the salary of a person.
Let us go through some of the key terms commonly used in decision trees.
Root Node- It represents the entire population or the given sample and further gets divided into
two or more homogeneous sets.
Splitting- It enables the division of a node into two or more sub-nodes.
Decision Node- This is like sub-node splitting into further sub-nodes.
Leaf/Terminal Node- These are nodes with zero sub-nodes, that is, these nodes can’t be split
further.
Pruning- When the size of the decision trees is reduced by removing nodes, the process is
called pruning.
9. Branch/Subtree- A subsection of a decision tree is called as a branch or a sub-tree.
Parent and Child Node- A node which is divided further into small sub-nodes is called a parent
node of whereas sub-nodes are the children of this parent node.
There are some important terms that we first need to understand before we can implement
decision trees in python.
Impurity
Impurity is the measure of unknown or redundant data which is evident when there are traces of
one class into another. There are reasons for its existence. The decision tree can run out of
classes to divide the class any further. We have assumed that we can allow some percentage of
impurity in our data for better performance which will introduce the impurity into our humble
model!
Entropy
Entropy is the degree of redundancy of elements or in other terms, it is a measure of impurity.
Mathematically, it can be calculated with the help of probability of the items as:
H= -Σp(x)*log[p(x)]
It is the negative summation of probability times the log of the probability of item x.
Information Gain
Information gain is the main ingredient that is instrumental in the construction and setting up of a
decision tree. Constructing a decision tree from scratch is all about finding the attribute that will
return the highest information gain in order to produce maximum accuracy in the decision trees.
Therefore, IG is equal to entropy(parent) - (average weights) * entropy(children)
Code in python
# Importing the important libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
10. # Fitting the Decision Tree Classification model to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Test set results prediction
y_pred = classifier.predict(X_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Random Forest
Random Forest algorithm is another approach to supervised classification algorithm after
decision trees. It is like the proper extension to decision trees algorithm. There is a correlation
between the number of trees in the forest and the results it calculates, hence. higher the
frequency of trees, the better and accurate will be the result.
Random forests are equivalent to ensemble learning technique for classification and regression
techniques. Random forest avoids the problem of overfitting by taking care of the fact that there
are enough trees in the model. Another advantage is that the classifier of random forests can
easily manage missing values. It can also be modeled for categorical values.
Working
Working of the random forest depends on 2 stages- one is creating a random forest and the
other is making predictions and extracting useful observations from the random forest classifier
created in the first stage.
These are some of the steps used in the creation of random forests.
• We need to select some random “k” features out of the total “m” features where k is less
than m.
• Among the selected “k” features, we need to calculate a node “d” applying the best split
point.
• We need to split the node into further nodes using the derived best split.
• Steps 1, 2 and 3 must be repeated until some “l” number of nodes has been achieved.
• Construct the forest by re-applying steps 1 to 4 for “n” number of times to create “n”
number of trees.
Applications
Stock market- A random forest can be used to identify the right stock which can attract profits
for the user at most times.
E-commerce- It can be effective in this field by predicting the products which the customer can
buy in future, based on their past choices.
11. Banking- It can recognize the defaulters and the non-defaulters by analyzing the behavior of
the customer through their past records.
Code in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(X_train)
x_test = sc.transform(X_test)
# Fitting the Random Forest Classification model to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 7, criterion = 'impurity', random_state = 0)
classifier.fit(x_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
K-means clustering
Clustering is the process of classifying the given data points into a number of groups or classes
such that the data points in the same groups are compatible with each other in terms of features
and characteristics. In simple words, k-means has the modus operandi of segregating points
into groups with similar properties and assign them into clusters.
Working
It starts with specifying the desired number of clusters ‘k’ required, let’s consider k as 2 for the
five random data points in 2-D space.
12. Then, we need to randomly assign each data point to a cluster. We will assign three points in
cluster 1 as shown in red color and two points in cluster 2 as shown in grey color.
Next, we need to compute centroids for these clusters, the centroid of data points in the red
cluster is signified by a red cross while for the grey cluster, it is shown using a grey cross.
13. Then comes the step of re-assigning each individual data point to the closest cluster centroid.
The data point which is at the bottom is assigned to the red cluster even though it is closer to
the centroid of the grey cluster. Hence, we assign that data point into the grey cluster.
In the end, we need to recompute cluster centroids- We have to recompute the centroids for
both the clusters.
14. Feature engineering is the process of using the domain knowledge and expertise to choose
which data variables to input as features before building a machine learning model. Feature
engineering plays a key role in k-means clustering; using meaningful features that capture the
variability and essence of data is essential before imputing the selected features for applying k-
means.
Feature transformations are conducted, particularly to represent rates rather than
measurements, which help in normalizing the data. At times, it is observed that this engineering
might help get rid of 80% of the error in a dataset. It proves to be effective in maintaining the
accuracy of machine learning model that is implemented to have great insights from the data.
Code in python
# Importing the required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('customers.csv')
x = dataset.iloc[:, [3, 4]].values
y = dataset.iloc[:, 3].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
15. # Finding the optimal number of clusters
from sklearn.cluster import KMeans
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Visualising the results using plots
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
# Fitting the K-Means algorithm to our dataset
kmeans = KMeans(n_clusters = 10, init = 'k-means++', random_state = 55)
y_kmeans = kmeans.fit_predict(x)
K-nearest Neighbor(K-NN)
K-nearest neighbor can be considered for both classification and regression problems. A KNN
model is taken into consideration when n number of points need to be classified into groups that
contain data-points or in this case, features of a dataset, in a homogeneous way. These data-
points are all similar to each other and are together. When a new point is introduced in the
plane, it is classified based on its characteristic which matches any homogenous group or class.
It is a non-parametric approach meaning it doesn't depend on data to establish a normal
distribution It is also referred to as lazy classification model which predicts classes based on the
features of observations that are matching.
Selecting the number of nearest neighbors, that is, selecting the value of k, plays a significant
role in calculating the capacity of our model. Selection of k will determine how well the data can
be used to characterize the results of the kNN algorithm. A large k-value will generally tend to
reduce the variance in data due to the noisy data; which will develop a bias. This might lead to
smaller patterns in data which can be fruitful.
There are many data points in the plane whose distance can be calculated by the following
techniques.
Euclidean Distance: Euclidean distance is calculated to be the square root of the sum of the
squared differences between a new point (x) and an existing point (y).
ED= √Σ(x^2-y^2)
Manhattan Distance: Manhattan distance is the distance between vectors using the sum of
their absolute difference.
MD= Σ|x-y|
16. Hamming Distance: It is in favor of categorical variables. If the value (x) and the value (y) are
same, the distance D will be equivalent to zero.
HD= Σ|x-y|
Where x=y when D=0 and x≠y when D=1
KNN is mostly used for searching purposes. It enables the search by finding the nearest item to
the customers' interests. It can also be implemented for building Recommender Systems. It will
find similar items based on the users personal taste or preference. Normally, the KNN algorithm
is not preferred much when compared to SVM or neural networks as it runs slower compared to
other algorithms.
Code in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the data into the training and test set
from sklearn.cross_validation import train_test_split
x_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(X_train)
x_test = sc.transform(X_test)
# Fitting our K-nearest neighbor model to the Training data
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'hamming', p = 2)
classifier.fit(x_train, y_train)
# Test data result prediction
y_pred = classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Naive Bayes
Naive Bayes is a basic technique for building classifiers. These models assign class labels to
problem instances, represented as vectors of features. It is a part of classification techniques
17. based on Bayes’ theorem with the assumption that there exists independence between
predictor variables.
In plain terms, a naive Bayes classifier calculates the probability of the outcome assuming that
the presence of a defining feature in a class is not at all related to the presence of any other
feature in another class. For instance, a knife may be considered to have features like
sharpness, being made of stainless steel and a size of 20 inches. These features do not depend
on each other for their existence. Similarly, a naive Bayes approach would take into account all
of the properties of each variable to independently contribute to their probability.
Naive Bayes classifiers need to be trained effectively in a supervised learning setting for
different sorts of probability models. In many practical applications, parameter estimation for
naive bayes models depends on the execution maximum likelihood, which mean that one can
work with the naive bayes model without calculating the bayesian probability or using any
appropriate Bayesian methods.
P(c/x)=
P(x/c)*P(x)
P(x)
where, P(c|x) is called the posterior probability of target given predictor which is x(features), P(c)
is known prior probability of class, P(x|c) is the likelihood, which is the probability of predictor
given class and P(x) is the prior probability of predictor.
Code in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('SN_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the data into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# Fitting the Naive Bayes model to the Training data
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
18. classifier.fit(x_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)