MACHINE LEARNING FOR DATA SCIENCE INTRODUCTION
By
NIKHIL GR
STUDENT
CSE
SJCIT
Contents
• Introduction to Data Science
• Applications of Data Science
• Foundations of Data Science
• Machine Learning
• Supervised Learning
• Classification
• Logistic Regression
• Decision Tree
• Random Forest
• K-Nearest Neighbor
• Support Vector Machine
• Regression
• Unsupervised Learning
• Cluster Analysis
• Principal Component Analysis
Introduction to DataScience
• Data science is a multi-disciplinary field which uses scientific
methods, processes, algorithms and systems to extract knowledge
and insights from structured and unstructured data.
• It is a blend of computer Science, Mathematics and
business/domain expertise.
Applications of DataScience
Foundations of DataScience
• Statistics: Descriptive, Inferential.
• Linear Algebra: Matrices, Planes, Vectors, etc.
• Computer Science: Algorithm, Graph Theory, Data Structure,
DBMS, etc.
• Machine Learning: Supervised, Unsupervised, Reinforcement.
• Business Analytics: Predictive, Prescriptive, Descriptive,
Decision.
• Programming: R/Python, SQL, NoSQL.
Machine Learning
• Machine learning is a subfield of computer science which focuses to
develop the computer algorithm to learn from examples and improve
the performance of a task.
• The algorithms in machine learning use training data which is the set
of past observations.
• There are three broad categories of machine learning:
 Supervised Learning: Which learns from labeled examples.
 Unsupervised Learning: Which learns from unlabeled examples.
 Reinforcement Learning: Which learns from environment through feedbacks.
• It develops predictive analytics models which allow researchers, data
scientists to predict about future based on past and current data.
SupervisedLearning
• It is a category of machine learning algorithms. As name indicates, it
is supervised by the presence of output in the training data.
• It learns from the labelled data – input for which output is known.
• It builds a mathematical model of a set of data that contains both the
inputs and the desired outputs.
• A supervised learning algorithm analyzes the training data and
produces an inferred function, which can be used for mapping new
examples.
• Generally, all the supervised learning problems are classified into
Classification and Regression problems.
Classification
• Classification in machine learning is a supervised learning
problem where the output variable is a category, such as “yes”
or “no” or “disease” and “no disease”.
• In this problem, the dependent variable is categorical whose
category is predicted based on several independent variables.
• A classification model attempts to draw some conclusion from
observed values.
• Given one or more inputs a classification model will try to predict
the value of one or more outcomes.
• There are a number of classification models.
Classification through machine learning
algorithms
Following are the popular machine learning algorithms which are
used in classification problems:-
• Logistic Regression
• Decision Tree
• Random Forest
• K-Nearest Neighbor
• Support Vector Machine
LogisticRegression
• This regression model is used when the dependent variable is
categorical.
• There are binary outputs of categories in this case.
DecisionTree
• A Decision tree is a flowchart like tree structure, where each
internal node denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf node holds a
class label.
Example:-
RandomForest
• Randomforests or random decisionforest is an ensemble
learning method that consists a large number of decision trees.
• Each individual tree in the random forest spits out a class
prediction and the class with the most votes becomes our
model’s prediction.
Example:
K-NearestNeighbor
• In k-NN classification, the output is a class membership of a
new observation.
• An object is classified by a plurality vote of its neighbors, with
the object being assigned to the class most common among its
k nearest neighbors.
• Example:
Support VectorMachine
• In Support Vector Machine (SVM), we plot each data item as a
point in n-dimensional space (where n is the number of features
you have) with the value of each feature being the value of a
particular coordinate.
• Then, we perform classification by finding the
hyperplane that differentiate the two classes very well.
• To identify the hyperplane, we try to maximize the distance
between boundary elements of separated classes.
• Variety of kernel functions are used to separate observations
based on whether they are linear separable or non-linearly
separable.
Regression
• Regression in machine learning is supervised learning problem
where the output variable is a real or continuous value, such as
“salary” or “weight”.
• Many different models can be used, the simplest is the linear
regression.
• It tries to fit data with the best hyper-plane which goes through the
points.
• There are various techniques used for regression analysis such as
Linear Regression, Decision Tree Regression, Random Forest
Regression etc.
UnsupervisedLearning
• Unsupervised learning is performed on the unlabeled data –
there are no input output labels (categories) are given in the
data.
• Here the task of machine is to group unsorted information
according to similarities, patterns and differences without any
prior training of data.
• Two of the main methods used in unsupervised learning are:
• Principal component Analysis, and
• Cluster analysis.
Principal ComponentAnalysis
• Principal component analysisis a method of extracting
important variables from a large set of variables available in a
data set.
• It extractslow dimensional set of features from a high
dimensional data set with a motive to capture as much
information as possible.
ClusterAnalysis
Vaibhav Kumar@DIT
University
• Cluster analysis or clustering is the task of grouping a set of
objects in such a way that objects in the same group (called a
cluster) are more similar (in some sense) to each other than to
those in other groups (clusters).
• Cluster analysis can be achieved by various algorithms that
differ significantly in their understanding of what constitutes a
cluster and how to efficiently find them.
Example:
ThankingYou

ML SFCSE.pptx

  • 1.
    MACHINE LEARNING FORDATA SCIENCE INTRODUCTION By NIKHIL GR STUDENT CSE SJCIT
  • 2.
    Contents • Introduction toData Science • Applications of Data Science • Foundations of Data Science • Machine Learning • Supervised Learning • Classification • Logistic Regression • Decision Tree • Random Forest • K-Nearest Neighbor • Support Vector Machine • Regression • Unsupervised Learning • Cluster Analysis • Principal Component Analysis
  • 3.
    Introduction to DataScience •Data science is a multi-disciplinary field which uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. • It is a blend of computer Science, Mathematics and business/domain expertise.
  • 4.
  • 5.
    Foundations of DataScience •Statistics: Descriptive, Inferential. • Linear Algebra: Matrices, Planes, Vectors, etc. • Computer Science: Algorithm, Graph Theory, Data Structure, DBMS, etc. • Machine Learning: Supervised, Unsupervised, Reinforcement. • Business Analytics: Predictive, Prescriptive, Descriptive, Decision. • Programming: R/Python, SQL, NoSQL.
  • 7.
    Machine Learning • Machinelearning is a subfield of computer science which focuses to develop the computer algorithm to learn from examples and improve the performance of a task. • The algorithms in machine learning use training data which is the set of past observations. • There are three broad categories of machine learning:  Supervised Learning: Which learns from labeled examples.  Unsupervised Learning: Which learns from unlabeled examples.  Reinforcement Learning: Which learns from environment through feedbacks. • It develops predictive analytics models which allow researchers, data scientists to predict about future based on past and current data.
  • 8.
    SupervisedLearning • It isa category of machine learning algorithms. As name indicates, it is supervised by the presence of output in the training data. • It learns from the labelled data – input for which output is known. • It builds a mathematical model of a set of data that contains both the inputs and the desired outputs. • A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. • Generally, all the supervised learning problems are classified into Classification and Regression problems.
  • 9.
    Classification • Classification inmachine learning is a supervised learning problem where the output variable is a category, such as “yes” or “no” or “disease” and “no disease”. • In this problem, the dependent variable is categorical whose category is predicted based on several independent variables. • A classification model attempts to draw some conclusion from observed values. • Given one or more inputs a classification model will try to predict the value of one or more outcomes. • There are a number of classification models.
  • 10.
    Classification through machinelearning algorithms Following are the popular machine learning algorithms which are used in classification problems:- • Logistic Regression • Decision Tree • Random Forest • K-Nearest Neighbor • Support Vector Machine
  • 11.
    LogisticRegression • This regressionmodel is used when the dependent variable is categorical. • There are binary outputs of categories in this case.
  • 12.
    DecisionTree • A Decisiontree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node holds a class label. Example:-
  • 13.
    RandomForest • Randomforests orrandom decisionforest is an ensemble learning method that consists a large number of decision trees. • Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction. Example:
  • 14.
    K-NearestNeighbor • In k-NNclassification, the output is a class membership of a new observation. • An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. • Example:
  • 15.
    Support VectorMachine • InSupport Vector Machine (SVM), we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. • Then, we perform classification by finding the hyperplane that differentiate the two classes very well. • To identify the hyperplane, we try to maximize the distance between boundary elements of separated classes. • Variety of kernel functions are used to separate observations based on whether they are linear separable or non-linearly separable.
  • 17.
    Regression • Regression inmachine learning is supervised learning problem where the output variable is a real or continuous value, such as “salary” or “weight”. • Many different models can be used, the simplest is the linear regression. • It tries to fit data with the best hyper-plane which goes through the points. • There are various techniques used for regression analysis such as Linear Regression, Decision Tree Regression, Random Forest Regression etc.
  • 18.
    UnsupervisedLearning • Unsupervised learningis performed on the unlabeled data – there are no input output labels (categories) are given in the data. • Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data. • Two of the main methods used in unsupervised learning are: • Principal component Analysis, and • Cluster analysis.
  • 19.
    Principal ComponentAnalysis • Principalcomponent analysisis a method of extracting important variables from a large set of variables available in a data set. • It extractslow dimensional set of features from a high dimensional data set with a motive to capture as much information as possible.
  • 20.
    ClusterAnalysis Vaibhav Kumar@DIT University • Clusteranalysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). • Cluster analysis can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Example:
  • 21.