- 1. Machine Learning Presented by Mr. Raviraj Solanki
- 2. Unit-1 Topics Introduction to Machine Learning, Model Preparation, Modelling and Evaluation Human learning versus machine learning, types of machine learning, applications of machine learning, tools for machine learning, Machine Learning Activities, Data structures for machine learning, Data Pre-processing, selecting a model, training a model, model representation and interpretability, evaluating performance of a model, improving performance of a model
- 3. Introduction of ML Machine learning is a growing technology which enables computers to learn automatically from past data. Machine learning uses various algorithms for building mathematical models and making predictions using historical data or information. Currently, it is being used for various tasks such as image recognition, speech recognition, email filtering, Facebook auto-tagging, recommender system, and many more.
- 4. Machine Learning Definitions Algorithm:A Machine Learning algorithm is a set of rules and statistical techniques used to learn patterns from data and draw significant information from it. It is the logic behind a Machine Learning model.An example of a Machine Learning algorithm is the Linear Regression algorithm. Model:A model is the main component of Machine Learning.A model is trained by using a Machine Learning Algorithm.An algorithm maps all the decisions that a model is supposed to take based on the given input, in order to get the correct output. PredictorVariable: It is a feature(s) of the data that can be used to predict the output.
- 5. ResponseVariable: It is the feature or the output variable that needs to be predicted by using the predictor variable(s). Training Data:The Machine Learning model is built using the training data.The training data helps the model to identify key trends and patterns essential to predict the output. Testing Data: After the model is trained, it must be tested to evaluate how accurately it can predict an outcome.This is done by the testing data set.
- 6. What is Machine Learning? Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development of algorithms which allow a computer to learn from the data and past experiences on their own. The term machine learning was first introduced by Arthur Samuel in 1959. In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method. The main focus of ML is to allow computer systems learn from experience without being explicitly programmed or human intervention. Machine learning enables a machine to automatically learn from data, improve performance from experiences, and predict things without being explicitly programmed.
- 9. With the help of sample historical data, which is known as training data, machine learning algorithms build a mathematical model that helps in making predictions or decisions without being explicitly programmed. Machine learning brings computer science and statistics together for creating predictive models. Machine learning constructs or uses the algorithms that learn from historical data. The more we will provide the information, the higher will be the performance. A machine has the ability to learn if it can improve its performance by gaining more data.
- 10. Improve their performance (P) At executing some task (T) Over time with experience (E)
- 11. Human learning versus machine learning In traditional programming, a programmer code all the rules in consultation with an expert in the industry for which software is being developed. Each rule is based on a logical foundation; the machine will execute an output following the logical statement. When the system grows complex, more rules need to be written. It can quickly become unsustainable to maintain.
- 12. Machine learning is supposed to overcome this issue. The machine learns how the input and output data are correlated and it writes a rule. The programmers do not need to write new rules each time there is new data. The algorithms adapt in response to new data and experiences to improve efficacy over time.
- 13. How does Machine Learning work A Machine Learning system learns from historical data, builds the prediction models, and whenever it receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount of data, as the huge amount of data helps to build a better model which predicts the output more accurately. Suppose we have a complex problem, where we need to perform some predictions, so instead of writing a code for it, we just need to feed the data to generic algorithms, and with the help of these algorithms, machine builds the logic as per the data and predict the output. Machine learning has changed our way of thinking about the problem.
- 16. The life of Machine Learning programs is straightforward and can be summarized in the following points: Define a question Collect data Visualize data Train algorithm Test the Algorithm Collect feedback Refine the algorithm Loop 4-7 until the results are satisfying Use the model to make a prediction Once the algorithm gets good at drawing the right conclusions, it applies that knowledge to new sets of data.
- 17. Types of machine learning
- 20. 1) Supervised Learning Supervised learning is a type of machine learning method in which we provide sample labeled data to the machine learning system in order to train it, and on that basis, it predicts the output. The labeled data set is nothing but the training data set. The system creates a model using labeled data to understand the datasets and learn about each data, once the training and processing are done then we test the model by providing a sample data to check whether it is predicting the exact output or not. The goal of supervised learning is to map input data with the output data. The supervised learning is based on supervision, and it is the same as when a student learns things in the supervision of the teacher.
- 21. Example of Supervised Learning
- 22. The machine images of Tom and Jerry and the goal is for the machine to identify and classify the images into two groups (Tom images and Jerry images). The training data set that is fed to the model is labeled, as in, we’re telling the machine, ‘this is howTom looks and this is Jerry’. By doing so you’re training the machine by using labeled data. In Supervised Learning, there is a well-defined training phase done with the help of labeled data.
- 23. Supervised learning can be grouped further in two categories of algorithms:
- 25. Classification : It is a Supervised Learning task where output is having defined labels(discrete value). Classification algorithms are used when the output variable is categorical, which means there are two classes such asYes-No, Male-Female,True-false, etc. For example in above Figure , Output – Purchased has defined labels i.e. 0 or 1 ; 1 means the customer will purchase and 0 means that customer won’t purchase. The goal here is to predict discrete values belonging to a particular class and evaluate on the basis of accuracy. It can be either binary or multi class classification. In binary classification, model predicts either 0 or 1 ; yes or no but in case of multi class classification, model predicts more than one class. Example: Gmail classifies mails in more than one classes like social, promotions, updates, forum.
- 26. Below are some popular classification algorithms which come under supervised learning: Random Forest DecisionTrees Logistic Regression Support vector Machines
- 28. Regression : It is a Supervised Learning task where output is having continuous value. Regression algorithms are used if there is a relationship between the input variable and the output variable. It is used for the prediction of continuous variables, such asWeather forecasting, MarketTrends, etc. Example in above Figure B, Output –Wind Speed is not having any discrete value but is continuous in the particular range. The goal here is to predict a value as much closer to actual output value as our model can and then evaluation is done by calculating error value. The smaller the error the greater the accuracy of our regression model.
- 29. Below are some popular Regression algorithms which come under supervised learning: Linear Regression RegressionTrees Non-Linear Regression Bayesian Linear Regression Polynomial Regression
- 30. Regression Algorithm Classification Algorithm In Regression, the output variable must be of continuous nature or real value. In Classification, the output variable must be a discrete value. The task of the regression algorithm is to map the input value (x) with the continuous output variable(y). The task of the classification algorithm is to map the input value(x) with the discrete output variable(y). RegressionAlgorithms are used with continuous data. ClassificationAlgorithms are used with discrete data. In Regression, we try to find the best fit line, which can predict the output more accurately. In Classification, we try to find the decision boundary, which can divide the dataset into different classes. Regression algorithms can be used to solve the regression problems such asWeather Prediction, House price prediction, etc. Classification Algorithms can be used to solve classification problems such as Identification of spam emails, Speech Recognition, Identification of cancer cells, etc.
- 31. Regression Algorithm Classification Algorithm The regression Algorithm can be further divided into Linear and Non-linear Regression. The Classification algorithms can be divided into Binary Classifier and Multi- class Classifier.
- 32. Advantages of Supervised learning: With the help of supervised learning, the model can predict the output on the basis of prior experiences. In supervised learning, we can have an exact idea about the classes of objects. Supervised learning model helps us to solve various real-world problems such as fraud detection, spam filtering, etc.
- 33. Disadvantages of supervised learning: Supervised learning models are not suitable for handling the complex tasks. Supervised learning cannot predict the correct output if the test data is different from the training dataset. Training required lots of computation times. In supervised learning, we need enough knowledge about the classes of object.
- 34. 2 ) Unsupervised Learning Unsupervised learning is a learning method in which a machine learns without any supervision. The training is provided to the machine with the set of data that has not been labeled, classified, or categorized, and the algorithm needs to act on that data without any supervision. The goal of unsupervised learning is to restructure the input data into new features or a group of objects with similar patterns.
- 35. In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights from the huge amount of data. It can be further classifieds into two categories of algorithms: Clustering Association
- 36. EDA : Exploratory data analysis
- 37. For example, it identifies prominent features ofT om such as pointy ears, bigger size, etc, to understand that this image is of type 1. Similarly, it finds such features in Jerry and knows that this image is of type 2. Therefore, it classifies the images into two different classes without knowing whoTom is or Jerry is.
- 38. Why Unsupervised Learning? Unsupervised machine learning finds all kind of unknown patterns in data. Unsupervised methods help you to find features which can be useful for categorization. It is taken place in real time, so all the input data to be analyzed and labeled in the presence of learners. It is easier to get unlabeled data from a computer than labeled data, which needs manual intervention.
- 40. Clustering: Clustering is a method of grouping the objects into clusters such that objects with most similarities remains into a group and has less or no similarities with the objects of another group. Cluster analysis finds the commonalities between the data objects and categorizes them as per the presence and absence of those commonalities.
- 41. Association: An association rule is an unsupervised learning method which is used for finding the relationships between variables in the large database. It determines the set of items that occurs together in the dataset. Association rule makes marketing strategy more effective. Such as people who buy X item (suppose a bread) are also tend to purchaseY (Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
- 42. Unsupervised Learning algorithms: K-means clustering KNN (k-nearest neighbors) Hierarchal clustering Anomaly detection Neural Networks Principle Component Analysis Independent Component Analysis Apriori algorithm Singular value decomposition
- 43. Advantages of Unsupervised Learning Unsupervised learning is used for more complex tasks as compared to supervised learning because, in unsupervised learning, we don't have labeled input data. Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.
- 44. Disadvantages of Unsupervised Learning Unsupervised learning is intrinsically more difficult than supervised learning as it does not have corresponding output. The result of the unsupervised learning algorithm might be less accurate as input data is not labeled, and algorithms do not know the exact output in advance.
- 45. Supervised Learning Unsupervised Learning Supervised learning algorithms are trained using labeled data. Unsupervised learning algorithms are trained using unlabeled data. Supervised learning model takes direct feedback to check if it is predicting correct output or not. Unsupervised learning model does not take any feedback. Supervised learning model predicts the output. Unsupervised learning model finds the hidden patterns in data. In supervised learning, input data is provided to the model along with the output. In unsupervised learning, only input data is provided to the model. The goal of supervised learning is to train the model so that it can predict the output when it is given new data. The goal of unsupervised learning is to find the hidden patterns and useful insights from the unknown dataset.
- 46. Supervised Learning Unsupervised Learning Supervised learning needs supervision to train the model. Unsupervised learning does not need any supervision to train the model. Supervised learning can be categorized in Classification and Regression problems . Unsupervised Learning can be classified in Clustering andAssociations problems. Supervised learning can be used for those cases where we know the input as well as corresponding outputs. Unsupervised learning can be used for those cases where we have only input data and no corresponding output data. Supervised learning model produces an accurate result. Unsupervised learning model may give less accurate result as compared to supervised learning. Supervised learning is not close to true Artificial intelligence as in this, we first train the model for each data, and then only it can predict the correct output. Unsupervised learning is more close to the true Artificial Intelligence as it learns similarly as a child learns daily routine things by his experiences.
- 47. Supervised Learning Unsupervised Learning It includes various algorithms such as Linear Regression, Logistic Regression, SupportVector Machine, Multi-class Classification, Decision tree, Bayesian Logic, etc. It includes various algorithms such as Clustering, KNN, andApriori algorithm.
- 48. 3) Reinforcement Learning Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and improves its performance. In reinforcement learning, the agent interacts with the environment and explores it. The goal of an agent is to get the most reward points, and hence, it improves its performance. Reinforcement Learning is a part of Machine learning where an agent is put in an environment and he learns to behave in this environment by performing certain actions and observing the rewards which it gets from those actions.
- 50. For ex.You will learn how to live on the island.You will explore the environment, understand the climate condition, the type of food that grows there, the dangers of the island, etc.This is exactly how Reinforcement Learning works, it involves an Agent (you, stuck on the island) that is put in an unknown environment (island), where he must learn by observing and performing actions that result in rewards. Ex. KKK Reinforcement Learning is mainly used in advanced Machine Learning areas such as self-driving cars, AplhaGo, etc. The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement learning.
- 51. some important terms used in Reinforcement Agent: It is an assumed entity which performs actions in an environment to gain some reward. Environment (e): A scenario that an agent has to face. Reward (R): An immediate return given to an agent when he or she performs specific action or task. State (s): State refers to the current situation returned by the environment. Policy (π): It is a strategy which applies by the agent to decide the next action based on the current state. Value (V): It is expected long-term return with discount, as compared to the short-term reward. Value Function: It specifies the value of a state that is the total amount of reward. It is an agent which should be expected beginning from that state. Model of the environment: This mimics the behavior of the environment. It helps you to make inferences to be made and also determine how the environment will behave. Model based methods: It is a method for solving reinforcement learning problems which use model-based methods. Q value or action value (Q): Q value is quite similar to value.The only difference between the two is that it takes an additional parameter as a current action.
- 52. Reinforcement Learning Algorithms Value-Based: In a value-based Reinforcement Learning method, you should try to maximize a value function V(s). In this method, the agent is expecting a long-term return of the current states under policy π. Policy-based: In a policy-based RL method, you try to come up with such a policy that the action performed in every state helps you to gain maximum reward in the future. Two types of policy-based methods are: Deterministic: For any state, the same action is produced by the policy π. Stochastic: Every action has a certain probability, which is determined by the following equation.Stochastic Policy: n{as) = PA, = aS, =S]
- 53. Model-Based: In this Reinforcement Learning method, you need to create a virtual model for each environment.The agent learns to perform in that specific environment.
- 54. Reinforcement Learning Supervised Learning RL works by interacting with the environment. Supervised learning works on the existing dataset. The RL algorithm works like the human brain works when making some decisions. Supervised Learning works as when a human learns things in the supervision of a guide. There is no labeled dataset is present The labeled dataset is present. No previous training is provided to the learning agent. Training is provided to the algorithm so that it can predict the output. RL helps to take decisions sequentially. In Supervised learning, decisions are made when input is given.
- 57. Applications of machine learning
- 58. 1. Image Recognition Image recognition is one of the most common applications of machine learning. It is used to identify objects, persons, places, digital images, etc.The popular use case of image recognition and face detection is, Automatic friend tagging suggestion: Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo with our Facebook friends, then we automatically get a tagging suggestion with name, and the technology behind this is machine learning's face detection and recognition algorithm.
- 59. 2. Speech Recognition While using Google, we get an option of "Search by voice," it comes under speech recognition, and it's a popular application of machine learning. Speech recognition is a process of converting voice instructions into text, and it is also known as "Speech to text", or "Computer speech recognition." At present, machine learning algorithms are widely used by various applications of speech recognition. Google assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice instructions.
- 60. 3. Traffic prediction If we want to visit a new place, we take help of Google Maps, which shows us the correct path with the shortest route and predicts the traffic conditions. It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested with the help of two ways: RealTime location of the vehicle form Google Map app and sensors Average time has taken on past days at the same time. Everyone who is using Google Map is helping this app to make it better.
- 61. 4. Product recommendations: Machine learning is widely used by various e-commerce and entertainment companies such as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some product on Amazon, then we started getting an advertisement for the same product while internet surfing on the same browser and this is because of machine learning.
- 62. 5. Self-driving cars One of the most exciting applications of machine learning is self- driving cars. Machine learning plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is working on self-driving car. It is using unsupervised learning method to train the car models to detect people and objects while driving.
- 63. 6. Email Spam and Malware Filtering Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We always receive an important mail in our inbox with the important symbol and spam emails in our spam box, and the technology behind this is Machine learning. Below are some spam filters used by Gmail: Content Filter Header filter General blacklists filter Rules-based filters Permission filters Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve Bayes classifier are used for email spam filtering and malware detection.
- 64. 7. Virtual Personal Assistant We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the information using our voice instruction. These assistants can help us in various ways just by our voice instructions such as Play music, call someone, Open an email, Scheduling an appointment, etc.
- 65. 8. Online Fraud Detection Machine learning is making our online transaction safe and secure by detecting fraud transaction. Whenever we perform some online transaction, there may be various ways that a fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is a genuine transaction or a fraud transaction.
- 66. 9. Stock Market trading: Machine learning is widely used in stock market trading. In the stock market, there is always a risk of up and downs in shares, so for this machine learning's long short term memory neural network is used for the prediction of stock market trends. 10. Medical Diagnosis: In medical science, machine learning is used for diseases diagnoses. With this, medical technology is growing very fast and able to build 3D models that can predict the exact position of lesions in the brain. It helps in finding brain tumors and other brain-related diseases easily.
- 67. 11. Automatic Language Translation Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at all, as for this also machine learning helps us by converting the text into our known languages. Google's GNMT (Google Neural MachineTranslation) provide this feature, which is a Neural Machine Learning that translates the text into our familiar language, and it called as automatic translation.
- 68. Tools for machine learning Python Python is one of the most popular programming languages of recent times. Python, created by Guido van Rossum in 1991, is an open- source, high-level, general-purpose programming language. Python is a dynamic programming language that supports object- oriented, imperative, functional, and procedural development paradigms. Python is very popular in machine learning programming. Python is one of the first programming languages that got the support of machine learning via a variety of libraries and tools. Scikit andTensorFlow are two popular machine learning libraries available to Python developers.
- 69. R R language is a dynamic, array-based, object-oriented, imperative, functional, procedural, and reflective computer programming language. The language first appeared in 1993 but has become popular in past few years among data scientists and machine learning developers for its functional and statistical algorithm features. R language was created by Ross Ihaka and Robert Gentleman at the University ofAuckland, New Zealand. R is open-source and available on r-project.org and Github. Currently R is managed and developed under the R Foundation and the R Development CoreTeam. The current version of R is 3.5.2 that was released on Dec 20, 2018. R language is one of the most popular programming languages among data scientists and statistical engineers. R supports Linux, OS X, andWindows operating systems.
- 70. Matlab Matlab (Matrix laboratory) is a licensed commericial software with a robust support for a wide range of numerical computing. MATLAB has a huge user base across industry and academia. MATLAB is developed by MathWorks. MATLAB also provides extensive support of statistical functions and has a huge number of machine learning algorithms in-built. It also has the ability to scale up for large dataset by parallel processing on cluster and cloud.
- 71. SAS SAS (earlier known as ‘statistical Analysis System’) is another licensed commercial software which provides strong support for machine learning functionalities. Developed in C by SAS had its first release in the year 1976. SAS is a software suite computing different components. The basic data management functionalities are embedded in the Base SAS component whereas the other components like SAS/INSIGHT, Enterprise Miner, SAS/STAT, etc. help in specialized functions related to data mining and statistical analysis.
- 72. Other languages/tools Owned by IBM, SPSS (originally named as Statistical package for the social sciences) is a popular package supporting specialized data mining and statistical analysis. Julia is an open source, liberal license programming language for numerical analysis and computational science and also having ability to implement high-performance machine learning algorithms.
- 74. Activities Gathering Data Data preparation DataWrangling Analyse Data Train the model Test the model Deployment
- 75. 1. Gathering Data The goal of this step is to identify and obtain all data-related problems. In this step, we need to identify the different data sources, as data can be collected from various sources such as files, database, internet, or mobile devices. It is one of the most important steps of the life cycle. The quantity and quality of the collected data will determine the efficiency of the output. The more will be the data, the more accurate will be the prediction. This step includes the below tasks: Identify various data sources Collect data Integrate the data obtained from different sources By performing the above task, we get a coherent set of data, also called as a dataset. It will be used in further steps.
- 76. 2. Data preparation After collecting the data, we need to prepare it for further steps. Data preparation is a step where we put our data into a suitable place and prepare it to use in our machine learning training. In this step, first, we put all data together, and then randomize the ordering of data. This step can be further divided into two processes: Data exploration: It is used to understand the nature of data that we have to work with. We need to understand the characteristics, format, and quality of data. A better understanding of data leads to an effective outcome. In this, we find Correlations, general trends, and outliers. Data pre-processing: Now the next step is preprocessing of data for its analysis.
- 77. 3. Data Wrangling Data wrangling is the process of cleaning and converting raw data into a useable format. It is the process of cleaning the data, selecting the variable to use, and transforming the data in a proper format to make it more suitable for analysis in the next step. It is one of the most important steps of the complete process. Cleaning of data is required to address the quality issues. It is not necessary that data we have collected is always of our use as some of the data may not be useful. In real-world applications, collected data may have various issues, including: MissingValues Duplicate data Invalid data Noise So, we use various filtering techniques to clean the data. It is mandatory to detect and remove the above issues because it can negatively affect the quality of the outcome.
- 78. 4. Data Analysis Now the cleaned and prepared data is passed on to the analysis step.This step involves: Selection of analytical techniques Building models Review the result The aim of this step is to build a machine learning model to analyze the data using various analytical techniques and review the outcome. It starts with the determination of the type of the problems, where we select the machine learning techniques such as Classification, Regression, Cluster analysis, Association, etc. then build the model using prepared data, and evaluate the model. Hence, in this step, we take the data and use machine learning algorithms to build the model.
- 79. 5. Train Model Now the next step is to train the model, in this step we train our model to improve its performance for better outcome of the problem. We use datasets to train the model using various machine learning algorithms. Training a model is required so that it can understand the various patterns, rules, and, features.
- 80. 6. Test Model Once our machine learning model has been trained on a given dataset, then we test the model. In this step, we check for the accuracy of our model by providing a test dataset to it. Testing the model determines the percentage accuracy of the model as per the requirement of project or problem.
- 81. 7. Deployment The last step of machine learning life cycle is deployment, where we deploy the model in the real-world system. If the above-prepared model is producing an accurate result as per our requirement with acceptable speed, then we deploy the model in the real system. But before deploying the project, we will check whether it is improving its performance using available data or not. The deployment phase is similar to making the final report for a project.
- 82. Basic Data types in ML Data can broadly be divided into following two types. Qualitative data Quantitative data Qualitative data divided it provides information about the quality of an object or information which cannot be measured. For example, if we consider the quality of performance of students in terms of‘Good’,‘Average’, and‘Poor’, it falls under the category of qualitative data. Qualitative data is also called categorical data. Qualitative data divided into 2 parts, Nominal data Ordinal data
- 83. Nominal data It is one which has no numeric value, but a named value. It is used for assigning named values to attributes. Nominal values cannot be quantified. For examples, Blood group:A,B,O,AB, etc Nationality: Indian,American, British, etc Gender: Male, Female, Other We can not do any mathematical operations on nominal data such as mean, variance, etc.
- 85. Ordinal data It is used to possessing the properties of nominal data, can also be natural ordered. This means the ordinal data also assigns named values to attributes but unlike nominal data, they can be arranged in a sequence of increasing or decreasing value so that we can say whether a value is better than or greater than another value. For examples, Customer satisfaction: ‘Very Happy’,‘Happy’,‘Unhappy’, etc Grades: A,B,C,etc Median and quartiles can be identified but mean can still not identified.
- 87. Quantitative Data It is relates to information about the quantity of an object. Hence, it can be measured. Quantitative Data (Continuous Data) represents measurements and therefore their values can’t be counted but they can be measured. For example, if we consider the attributes ‘marks’, it can be measured. Quantitative data is also termed as numeric data. There are two types of Quantitative Data, Interval Data Ratio Data
- 88. Interval Data Interval data is numeric data for which not only the order is known, but the exact difference b/w values is known. For example, CelsiusTemperature. The difference b/w 12C and 18C degrees is 6C and same as for 15.5C and 21.5C. other examples include data, time, etc. For interval data,We can do any mathematical operations on interval data such as mean, median , mode, variance, SD etc. Data don’t have ‘true zero’ value. For example, we can not say‘0 temperature’ or‘No temperature’.
- 90. Ratio Data It represents numeric data for which exact value can be measured. Absolute zero is available for ratio data. OnThese variables we can do mathematical operations. For example, Height, weight, age, salary, etc
- 92. S.N. Character Quantitative Data Qualitative Data 1. Definition These are data that deal with quantities, values, or numbers. These data, on the other hand, deals with quality. 2. Measurability Measurable. They are generally not measurable. 3. Nature of Data Expressed in numerical form. They are descriptive rather than numerical in nature. 4. Research Methodology Conclusive investigative 5. Quantities measured Measures quantities such as length, size, amount, price, and even duration. Narratives often make use of adjectives and other descriptive words to refer to data on appearance, color, texture, and other qualities. 6. Data Structure Structured Unstructured
- 93. Data structures for machine learning Auto MPG data set
- 95. DATA REMEDIATION Data remediation is a part of data quality. Data remediation is an activity that’s focused on cleansing, organizing and migrating data so it’s fit for purpose or use. The process typically involves detecting and correcting (or removing) corrupt or inaccurate records by replacing, modifying or deleting the “dirty” data. It can be performed manually, with cleansing tools, as a batch process (script), through data migration or a combination of these methods.
- 96. 1. Handling outliers An outlier is a piece of data that is an abnormal distance from other points. In other words, it’s data that lies outside the other values in the set. Outliers can have many causes, such as: Measurement or input error. Data corruption.
- 98. Remove outliers: if the number of records which are outliers is not many, a simple approach may be to remove them. Imputation : one other way is to impute (assign) the value with mean or median or mode.The value of the most simmiler data element may also be used for imputation. Capping :For the values that lie outside the 1.5[x] IQR ( interquartile range) limits, we can cap them by replacing those observations below the lower limit value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.
- 100. Dimensionality Reduction In machine learning classification problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant (unnecessary). This is where dimensionality reduction algorithms come into play. Dimensionality reduction refers to reducing the number of input variables for a dataset, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
- 101. If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input to a model to predict the target variable. Input variables are also called features. We can consider the columns of data representing dimensions on an n- dimensional feature space and the rows of data as points in that space. This is a useful statistical analysis of a dataset. It is often desirable to reduce the number of input features.This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.” An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification problem, where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc.
- 102. The most common approach to dimensionality reduction is called principal components analysis or PCA. It makes the large data set simpler, easy to explore and visualize. Principal Component Analysis(PCA) is one of the most popular linear dimension reduction. Sometimes, it is used alone and sometimes as a starting solution for other dimension reduction methods. PCA is a projection based method which transforms the data by projecting it onto a set of orthogonal (right angles) axes.
- 104. Feature subset selection Feature subset selection or simply called feature selection, both for supervised as well as unsupervised learning, try to find out the optimal subset of the entire feature set which significantly reduces computational cost without any major impact on the learning accuracy. Feature Subset Selection Methods can be classified into three broad categories… Filter Methods Wrapper Methods Embedded Methods
- 105. Filter Methods In this method, select subsets of variables as a pre-processing step, independently of the used classifier It would be worthwhile to note that Variable Ranking-Feature Selection is a Filter Method. Filter Methods are usually fast. Filter Methods provide generic selection of features, not tuned by given learner (universal).
- 106. Wrapper Methods InWrapper Methods, the Learner is considered a black-box. Interface of the black-box is used to score subsets of variables according to the predictive power of the learner when using the subsets. Results vary for different learners. One needs to define: – how to search the space of all possible variable subsets ?– how to assess the prediction performance of a learner ?
- 107. Embedded Methods Embedded Methods are specific to a given learning machine Performs variable selection (implicitly) in the process of training E.g.WINNOW-algorithm (linear unit with multiplicative updates).
- 108. SELCTING A MODEL
- 109. SOME IMPORTANT POINTS Input variable can be denoted by X, while individual input variables are represented as X1,X2,X3…..Xn and output variable by symbolY. The relationship b/w X andY is represented in general form. Y= f(X) + e f is the target function and ‘e’ is a random error term. A cost function (error function) can tell how bad the model is performing and loss function a function defined on a data point, while cost function is for entire training data set. Objective function takes in data and model (along with parameters) as input and returns a value.Target is to find values of model parameter to maximize and minimize the return value. There is no one model that works best for every machine learning problem and that is what ‘No Free Lunch’ theorem also states. Supervised learning for solving predictive problems and unsupervised learning which solve descriptive problems.
- 110. Predictive Models Models for supervised learning or predictive models, as it understandable from name itself, try to predict certain value in an input data set. The learning model attempts to establish a relation b/w the target feature, i.e. the feature being predicted, and predictor features. The predictive models have a clear focus on what they want to learn and how they want to learn. Predictive models, in turn, may need to predict the values of a category or class to which data instance belongs to.
- 111. Below are example of predictive… Predicting win/loss in a cricket match. Predicting weather a transaction fraud. Predicting whether a customer may move to another product. The models which are used for prediction of target features of categorical value are known as classification models. The target feature known as a class and the categories to which classes are divided into are called levels. Some of the popular classification models include k-Nearest Neighbor (kNN), Naïve bayes, and DecisionTree.
- 112. Predictive models may also be used to predict numerical values of the target feature based on the predictor features. Some examples, Prediction of income growth in the succeeding year. Prediction of rainfall amount in the coming monsoon. The models which are used for prediction of the numerical value of the target feature of a data instance are known as regression models. Linear Regression Logistic Regression
- 113. Descriptive Models Models for unsupervised learning or descriptive models are used to describe a data set or gain insight from a data set. There is no target feature or single feature of interest in case of unsupervised learning. Based on the value of all features, interesting patterns or insight are derived about the data set. Descriptive models which group together similar data instance, i.e. data instance having a similar value of the different features are called clustering models.
- 114. Examples of clustering include.. Customer grouping or segmentation based on social, demographic, national, etc factors Grouping of music based on different aspects like type, language, time-period etc. Grouping of commodities in an inventory. The most popular model for clustering is k-Means. Descriptive models are related to pattern discovery is used for market basket analysis of transactional data.
- 115. Training A Model (For supervised Learning) Holdout Method The hold-out method splits the data into training data and test data. Typical ratios used for splitting the data set include 60:40, 80:20 etc. Then we build a classifier using the train data and test it using the test data. The hold-out method is usually used when we have thousands of instances, including several hundred instances from each class. This method is only used when we only have one model to evaluate. Training set Classifier Test set Data
- 116. Once evaluation is complete, all the data can be used to build the final classifier. Generally, the larger the training data the better the classifier (but returns smaller). The larger the test data the more accurate the error estimate. The accuracy we receive from the validation set is not considered final and another hold-out dataset which is the test dataset is used to evaluate the final selected model and the error found here is considered as the generalization error.
- 118. Classification: Train, Validation, Test Split Data Predictions Y N Results Known Training set Validation set + + - - + Classifier Builder Evaluate + - + - ClassifierFinalTest Set + - + - Final Evaluation Model Builder The test data can’t be used for parameter tuning!
- 120. What is Cross Validation? CrossValidation is a very useful technique for assessing the performance of machine learning models. It helps in knowing how the machine learning model would generalize to an independent data set. You want to use this technique to estimate how accurate the predictions your model will give in practice. When you are given a machine learning problem, you will be given two type of data sets — known data (training data set) and unknown data (test data set). By using cross validation, you would be “testing” your machine learning model in the “training” phase to check for overfitting and to get an idea about how your machine learning model will generalize to independent data (he data set which was not used for training the machine learning model), which is the test data set given in the problem.
- 121. K-fold Cross-validation method Usually, we split the data set into training and testing sets and use the training set to train the model and testing set to test the model. We then evaluate the model performance based on an error metric to determine the accuracy of the model. This method however, is not very reliable as the accuracy obtained for one test set can be very different to the accuracy obtained for a different test set. K-fold CrossValidation(CV) provides a solution to this problem by dividing the data into folds and ensuring that each fold is used as a testing set at some point.
- 122. K-Fold CV is where a given data set is split into a K number of sections/folds where each fold is used as a testing set at some point. This is one among the best approach if we have a limited input data. Lets take the scenario of 5-Fold cross validation(K=5). Here, the data set is split into 5 folds. In the first iteration, the first fold is used to test the model and the rest are used to train the model. In the second iteration, 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until each fold of the 5 folds have been used as the testing set. Then take the average of your recorded scores.That will be the performance metric for the model.
- 125. Steps to perform K-fold validation Split the entire data randomly into k folds (value of k shouldn’t be too small or too high, ideally we choose 5 to 10 depending on the data size). The higher value of K leads to less biased model, where as the lower value of K is similar to the train-test split approach we saw before. Then fit the model using the K — 1 (K minus 1) folds and validate the model using the remaining Kth fold. Note down the scores/errors. Repeat this process until every K-fold serve as the test set. Then take the average of your recorded scores. That will be the performance metric for the model.
- 126. >>> import numpy as np >>> from sklearn.model_selection import KFold >>> X = ["a", "b", "c", "d"] >>> kf = KFold(n_splits=2) >>> for train, test in kf.split(X): print("%s %s" % (train, test)) [2 3] [0 1] [0 1] [2 3]
- 127. Approaches in k-fold cross validation Two approaches used.. 10-fold cross-validation (10-fold CV) Leave-one-out cross validation (LOOCV)
- 128. With this method we have one data set which we divide randomly into 10 parts. We use 9 of those parts for training and reserve one tenth for testing. We repeat this procedure 10 times each time reserving a different tenth for testing. Calculate the average of all the k test errors and display the result. 10-fold cross-validation (10-fold CV)
- 130. Leave One Out Cross Validation (LOOCV) We can use LOOCV when data is limited and you want the absolute best error estimate for new data. LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for n samples, we have n different training sets and n different tests set.This cross-validation procedure does not waste much data as only one sample is removed from the training set. This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model and p points are used as the validation set.
- 131. This is repeated for all combinations in which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness. The number of possible combinations is equal to the number of data points in the original sample or n.
- 133. >>> from sklearn.model_selection import LeaveOneOut >>> X = [1, 2, 3, 4] >>> loo = LeaveOneOut() >>> for train, test in loo.split(X): . print("%s %s" % (train, test)) [1 2 3] [0] [0 2 3] [1] [0 1 3] [2] [0 1 2] [3]
- 134. Bootstrap Sampling Bootstrap sampling or simply bootstrapping is a popular way to identify training and test data from input data set. It uses the technique of Simple Random Sampling with Replacement (SRSWR) , which is a well known technique in sampling theory for drawing random samples. We have seen earlier that k-fold cross-validation divides the data into separate partitions- say 10 partitions in case of 10 fold cross-validation. then it uses data instances from partitions as test data and remaining partitions as training data.
- 135. Bootstrapping randomly picks data instances from input data set, with the possibility of the same data instance to be picked multiple times. This means that from the input data set having‘n’ data instances, bootstrapping can create one or more training data sets having‘n’ data instance, some of the data instances being repeated multiple times. This technique is particularly useful in case of input data sets of small size. i.e. having very less number of data instances.
- 136. Bootstrap Sampling
- 137. Example of Bootstrap sampling Let’s say we want to find the mean height of all the students in a school (which has a total population of 1,000). So, how can we perform this task? One approach is to measure the height of all the students and then compute the mean height.
- 138. Instead of measuring the heights of all the students, we can draw a random sample of 5 students and measure their heights. We would repeat this process 20 times and then average the collected height data of 100 students (5 x 20). This average height would be an estimate of the mean height of all the students of the school. This is the basic idea of Bootstrap Sampling.
- 139. Code for bootstrap sampling # scikit-learn bootstrap from sklearn.utils import resample # data sample data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] # prepare bootstrap sample boot = resample(data, replace=True, n_samples=4, random_state=1) print('Bootstrap Sample: %s' % boot) # out of bag observations oob = [x for x in data if x not in boot] print('OOB Sample: %s' % oob) Output: Bootstrap Sample: [0.6, 0.4, 0.5, 0.1] OOB Sample: [0.2, 0.3]
- 140. Cross-Validation Bootstrapping It is a special variant of holdout method, called repeated holdout. Hence uses stratified random sampling approach (without replacement.) It uses the technique of simple random sampling with replacement ( SRSWR), so the same data instance may be picked up multiple times in a sample. The number of possible training/test data samples that can be drawn using this technique is finite. Elements can be repeated in the sample, possible number of training/test data samples is unlimited.
- 141. Lazy vs. Eager Learning Lazy learning : Simply stores training data (or only minor processing) and waits until it is given a test tuple. Just store Data set without learning from it Start classifying data when it receive Test data So it takes less time learning and more time classifying data e.g. K - Nearest Neighbour, Case - Based Reasoning
- 142. Eager learning : Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify. When it receive data set it starts classifying (learning) Then it does not wait for test data to learn. It's fast as it has pre-calculated algorithm. So it takes long time learning and less time classifying data. e.g. DecisionTree, Naive Bayes,Artificial Neural Networks
- 144. A model is said to be a good machine learning model if it generalizes any new input data from the problem domain in a proper way. This helps us to make predictions in the future data, that data model has never seen. Now, suppose we want to check how well our machine learning model learns and generalizes to the new data. For that we have overfitting and underfitting, which are majorly responsible for the poor performances of the machine learning algorithms. Bias –Assumptions made by a model to make a function easier to learn. Variance – If you train your data on training data and obtain a very low error, upon changing the data and then training the same previous model you experience high error, this is variance.
- 146. Underfitting A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. (It’s just like trying to fit undersized cloths!) The input features are not explanatory enough to describe the target well. Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by using more data and also reducing the features by feature selection.
- 147. Underfitting – High bias and low variance. Techniques to reduce underfitting : 1. Increase model complexity. 2. Increase number of features, performing feature engineering 3. Remove noise from the data. 4. Increase the number of epochs or increase the duration of training to get better results.
- 148. Overfitting A statistical model is said to be overfitted, when we train it with a lot of data (just like fitting ourselves in oversized cloths!). When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too many details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.
- 149. Overfitting – High variance and low bias. Techniques to reduce overfitting : 1. Increase training data. 2. Reduce model complexity. 3. Early stopping during the training phase. 4. Use dropout for neural networks to tackle overfitting.
- 151. Bias-Variance Tradeoff Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a tradeoff between a model’s ability to minimize bias and variance. If our model is too simple and has very few parameters then it may have high bias and low variance (Underfitting). On the other hand if our model has large number of parameters then it’s going to have high variance and low bias (overfitting). So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.
- 152. Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Underfitting results in high bias. Error due toVariance:The error due to variance is taken as the variability of a model prediction for a given data point. Overfitting results in high variance.
- 153. bias variance trade off
- 154. High Bias LowVariance: Models are consistent but inaccurate on average. High Bias HighVariance : Models are inaccurate and also inconsistent on average. Low Bias LowVariance: Models are accurate and consistent on averages.We strive for this in our model. In fugure, the best solution is to have a model with low bias as well as low variance. However, that may not be possible in reality. Hence, the goal of supervised ML is to achieve a balance b/w bias and variance. For ex., popular supervised learning algorithm k-nearest Neighbors or kNN, the user configurable parameter‘k’ can be used to do a trade off b/w bias and variance.
- 155. Evaluating Performance of a model
- 157. Supervised learning - classification The responsibility of the classification model is to assign class label to the target feature based on the value of the predictor feature. For ex., in the problem of predicting the win/loss in a cricket match, the classifier will assign a class value win/loss to target feature based on the values of other features like whether the team won the toss, number of spinners in the team, number of wins the tournament, etc. To evaluate the performance of the model, the number of correct classifications or predictions made by the model has to be recorded. A classification is said to be correct if, say for example in the given problem, it has been predicted by the model that the team will win and it has actually win.
- 158. Based on the number of correct and incorrect classifications or predictions made by a model, the accuracy of the model is calculated. There are 4 possibilities with regards to the cricket match win/loss prediction: The model predicted win and the team won (TP =True Positive) The model predicted win and the team lost (FP = False Positive) The model predicted loss and the team won (FN = False Negative) The model predicted loss and the team lost (TN =True Negative) True positives (TP): Predicted positive and are actually positive. False positives (FP): Predicted positive and are actually negative. False negatives (FN): Predicted negative and are actually positive. True negatives (TN): Predicted negative and are actually negative.
- 159. Accuracy For any classification model, model accuracy is given by total number of correct classifications (either as the class of interest, i.e. True Positive or as not the class of interest, i.e.True Negative) divided by total number of classification done. Model accuracy = TP +TN _____________________ TP + FP + FN +TN
- 160. Confusion Matrix A matrix containing correct and incorrect predictions in the form ofTPs, FPs, FNs, andTNs is known as confusion matrix. The win/loss predictions of cricket match has two classes of interest – win and loss.