Data Preprocessing
machine learning algorithm can be deployed
1. Data collection: The data collection step involves gathering the learning
material an algorithm will use to generate actionable knowledge. In most
cases, the data will need to be combined into a single source like a text file,
spreadsheet, or database.
2. Data exploration and preparation: The quality of any machine learning
project is based largely on the quality of its input data. Thus, it is important
to learn more about the data and its nuances during a practice called data
exploration. Additional work is required to prepare the data for the learning
process. This involves fixing or cleaning so-called "messy" data, eliminating
unnecessary data, and recoding the data to conform to the learner's
expected inputs.
3. Model training: By the time the data has been prepared for analysis,
you are likely to have a sense of what you are capable of learning from the
data. The specific machine learning task chosen will inform the selection of
an appropriate algorithm, and the algorithm will represent the data in the
form of a model.
4. Model evaluation: Because each machine learning model results in a
biased solution to the learning problem, it is important to evaluate how well
the algorithm learns from its experience. Depending on the type of model
used, you might be able to evaluate the accuracy of the model using a test
dataset or you may need to develop measures of performance specific to
the intended application.
5. Model improvement: If better performance is needed, it becomes
necessary to utilize more advanced strategies to augment the performance
of the model. Sometimes, it may be necessary to switch to a different type
of model altogether. You may need to supplement your data with additional
data or perform additional preparatory work as in step two of this process
Managing and Understanding Data
• R data structure:
1)Vector
2) Factor
3) Matrix
4) Array
5) List
6) Data Frame
Managing data with R
Saving, loading, and removing R data structures
• 1) save(x, y, z, file = "mydata.RData")
• 2) load("mydata.RData")
• 3) ls() : list all the vector
• 4) rm(m, subject1) : The rm() remove function can be used for this
purpose.
Importing and saving data from CSV files
• 1) pt_data <- read.csv("pt_data.csv", stringsAsFactors = FALSE)
• 2) mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE, header
= FALSE)
• 3) write.csv(pt_data, file = "pt_data.csv", row.names = FALSE)
Exploring and understanding data
• 1) usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)
Exploring the structure of data
• 1) str(usedcars)
Exploring numeric variables
• summary(usedcars$year)
• summary(usedcars[c("price", "mileage")])
Measuring the central tendency – mean and median
• mean(c(36000, 44000, 56000))
• median(c(36000, 44000, 56000)
Measuring spread – quartiles and the five-number summary
• The five-number summary is a set of five statistics that roughly depict
the spread of a feature's values. All five of the statistics are included in
the output of the summary() function.
• Written in order, they are:
1. Minimum (Min.)
2. First quartile, or Q1 (1st Qu.)
3. Median, or Q2 (Median)
4. Third quartile, or Q3 (3rd Qu.)
5. Maximum (Max.)
• > range(usedcars$price)
• > diff(range(usedcars$price))
• IQR(usedcars$price) : The difference between Q1 and Q3 is known as
the Interquartile Range (IQR)
• quantile(usedcars$price
Exploring categorical variables
• table(usedcars$year)
• round(color_pct, digits = 1)
Data Preprocessing
• Data preprocessing is a process of preparing the raw data and
making it suitable for a model.
• When creating a project, it is not always a case that user come
across the clean and formatted data.
• And while doing any operation with data, it is mandatory to clean it
and put in a formatted way. So for this, user use data preprocessing
task.
Why do we need Data Preprocessing?
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
• Data preprocessing is required tasks for cleaning the data and
making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
Steps
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
Getting the dataset
• To create a machine learning model, the first thing we required is a
dataset as a machine learning model completely works on data. The
collected data for a particular problem in a proper format is known
as the dataset.
Importing Libraries
• These libraries are used to perform some specific jobs. There are
specific libraries that user will use for data preprocessing, which are:
corpus, tm etc.
Importing the Datasets
• Now user need to import the datasets which we have collected for
our machine learning project. But before importing a dataset, we
need to set the current directory as a working directory.
Handling Missing data:
• The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our machine
learning model. Hence it is necessary to handle missing values present in the dataset.
• Ways to handle missing data:
– There are mainly two ways to handle missing data, which are:
• By deleting the particular row: In this way, we just delete the specific row or
column which consists of null values. But this way is not so efficient and removing
data may lead to loss of information which will not give the accurate output.
• By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value. This strategy is useful for the features which have numeric data such as
age, salary, year, etc. Here, we will use this approach.
Encoding Categorical data:
• Categorical data is data which has some categories such as,
Country, and Purchased.
• Since machine learning model completely works on mathematics
and numbers, but if our dataset would have a categorical variable,
then it may create trouble while building the model. So it is
necessary to encode these categorical variables into numbers.
Splitting the Dataset into the Training set and Test set
• In machine learning data preprocessing, we divide our dataset into
a training set and test set. This is one of the crucial steps of data
preprocessing as by doing this, we can enhance the performance of
our machine learning model.
• Suppose, if we have given training to our machine learning model
by a dataset and we test it by a completely different dataset. Then, it
will create difficulties for our model to understand the correlations
between the models.
• If we train our model very well and its training accuracy is also very
high, but we provide a new dataset to it, then it will decrease the
performance.
• So we always try to make a machine learning model which performs
well with the training set and also with the test dataset. Here, we can
define these datasets as:
• Training Set: A subset of dataset to train the machine learning
model, and we already know the output.
• Test set: A subset of dataset to test the machine learning model, and
by using the test set, model predicts the output.
Feature Scaling
• Feature scaling is the final step of data preprocessing in machine
learning.
• It is a technique to standardize the independent variables of the
dataset in a specific range.
• In feature scaling, we put our variables in the same range and in the
same scale so that no any variable dominate the other variable.

data_preprocessingknnnaiveandothera.pptx

  • 1.
  • 2.
    machine learning algorithmcan be deployed 1. Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source like a text file, spreadsheet, or database. 2. Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs.
  • 3.
    3. Model training:By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model. 4. Model evaluation: Because each machine learning model results in a biased solution to the learning problem, it is important to evaluate how well the algorithm learns from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset or you may need to develop measures of performance specific to the intended application. 5. Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the performance of the model. Sometimes, it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work as in step two of this process
  • 4.
    Managing and UnderstandingData • R data structure: 1)Vector 2) Factor 3) Matrix 4) Array 5) List 6) Data Frame
  • 5.
  • 6.
    Saving, loading, andremoving R data structures • 1) save(x, y, z, file = "mydata.RData") • 2) load("mydata.RData") • 3) ls() : list all the vector • 4) rm(m, subject1) : The rm() remove function can be used for this purpose.
  • 7.
    Importing and savingdata from CSV files • 1) pt_data <- read.csv("pt_data.csv", stringsAsFactors = FALSE) • 2) mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE, header = FALSE) • 3) write.csv(pt_data, file = "pt_data.csv", row.names = FALSE)
  • 8.
    Exploring and understandingdata • 1) usedcars <- read.csv("usedcars.csv", stringsAsFactors = FALSE)
  • 9.
    Exploring the structureof data • 1) str(usedcars)
  • 10.
    Exploring numeric variables •summary(usedcars$year) • summary(usedcars[c("price", "mileage")])
  • 11.
    Measuring the centraltendency – mean and median • mean(c(36000, 44000, 56000)) • median(c(36000, 44000, 56000)
  • 12.
    Measuring spread –quartiles and the five-number summary • The five-number summary is a set of five statistics that roughly depict the spread of a feature's values. All five of the statistics are included in the output of the summary() function. • Written in order, they are: 1. Minimum (Min.) 2. First quartile, or Q1 (1st Qu.) 3. Median, or Q2 (Median) 4. Third quartile, or Q3 (3rd Qu.) 5. Maximum (Max.)
  • 13.
    • > range(usedcars$price) •> diff(range(usedcars$price)) • IQR(usedcars$price) : The difference between Q1 and Q3 is known as the Interquartile Range (IQR) • quantile(usedcars$price
  • 14.
    Exploring categorical variables •table(usedcars$year) • round(color_pct, digits = 1)
  • 15.
    Data Preprocessing • Datapreprocessing is a process of preparing the raw data and making it suitable for a model. • When creating a project, it is not always a case that user come across the clean and formatted data. • And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for this, user use data preprocessing task.
  • 16.
    Why do weneed Data Preprocessing? • A real-world data generally contains noises, missing values, and maybe in an unusable format which cannot be directly used for machine learning models. • Data preprocessing is required tasks for cleaning the data and making it suitable for a machine learning model which also increases the accuracy and efficiency of a machine learning model.
  • 17.
    Steps • Getting thedataset • Importing libraries • Importing datasets • Finding Missing Data • Encoding Categorical Data • Splitting dataset into training and test set • Feature scaling
  • 18.
    Getting the dataset •To create a machine learning model, the first thing we required is a dataset as a machine learning model completely works on data. The collected data for a particular problem in a proper format is known as the dataset.
  • 19.
    Importing Libraries • Theselibraries are used to perform some specific jobs. There are specific libraries that user will use for data preprocessing, which are: corpus, tm etc.
  • 20.
    Importing the Datasets •Now user need to import the datasets which we have collected for our machine learning project. But before importing a dataset, we need to set the current directory as a working directory.
  • 21.
    Handling Missing data: •The next step of data preprocessing is to handle missing data in the datasets. If our dataset contains some missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to handle missing values present in the dataset. • Ways to handle missing data: – There are mainly two ways to handle missing data, which are: • By deleting the particular row: In this way, we just delete the specific row or column which consists of null values. But this way is not so efficient and removing data may lead to loss of information which will not give the accurate output. • By calculating the mean: In this way, we will calculate the mean of that column or row which contains any missing value and will put it on the place of missing value. This strategy is useful for the features which have numeric data such as age, salary, year, etc. Here, we will use this approach.
  • 22.
    Encoding Categorical data: •Categorical data is data which has some categories such as, Country, and Purchased. • Since machine learning model completely works on mathematics and numbers, but if our dataset would have a categorical variable, then it may create trouble while building the model. So it is necessary to encode these categorical variables into numbers.
  • 23.
    Splitting the Datasetinto the Training set and Test set • In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our machine learning model. • Suppose, if we have given training to our machine learning model by a dataset and we test it by a completely different dataset. Then, it will create difficulties for our model to understand the correlations between the models.
  • 24.
    • If wetrain our model very well and its training accuracy is also very high, but we provide a new dataset to it, then it will decrease the performance. • So we always try to make a machine learning model which performs well with the training set and also with the test dataset. Here, we can define these datasets as: • Training Set: A subset of dataset to train the machine learning model, and we already know the output. • Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the output.
  • 25.
    Feature Scaling • Featurescaling is the final step of data preprocessing in machine learning. • It is a technique to standardize the independent variables of the dataset in a specific range. • In feature scaling, we put our variables in the same range and in the same scale so that no any variable dominate the other variable.