Data Science
Data Preprocessing
(Feature Selection and Merging )
Data
Preprocessing
Data
Integration
Data
Transforma
tion
Data
Reduction
or
dimension
reduction
Data
Cleaning
Scaling, Normalization,
Categorical Encoding
Handling missing
values Outliers,
duplicates
Selecting relevant
features/Column, Data
Combining multiple
datasets/ Merging
Feature Selection
A feature is an attribute that has an impact on a problem or is useful for the
problem, and choosing the important features for the model is known as
feature selection. Feature selection is often performed to remove irrelevant
or redundant features from the dataset.
We can define feature Selection as, "It is a process of automatically or
manually selecting the subset of most appropriate and relevant features to
be used in model building." Feature selection is performed by either
including the important features or excluding the irrelevant features in the
dataset without changing them.
Feature Selection Techniques
• Supervised Feature Selection technique : Supervised Feature selection
techniques consider the target variable and can be used for the labeled
dataset.
• Unsupervised Feature Selection technique: Unsupervised Feature
selection techniques ignore the target variable and can be used for the
unlabeled dataset.
Common techniques for selecting relevant features
 Feature Importance
 Recursive Feature Elimination(RFE)
 Forward/Backward Elimination
 Principal Component Analysis (PCA)
 Filter Method
 Domain Knowledge
Feature Extraction
Feature extraction involves transforming the original features into a new set
of features through mathematical transformations or projections.
Feature selection involves selecting a subset of the original features based
on their relevance, while feature extraction involves transforming the
original features into a new set of features. Both techniques are used for
dimensionality reduction to improve model performance, reduce overfitting,
and enhance interpretability.
Merging: Combining Multiple Datasets
Merging also known as joining, is a fundamental operation in data science
where we combine data from multiple datasets based on a common attribute
or key.
Merging is essential when dealing with essential datasets or when
integrating data from multiple sources. In merging it is important to ensure
that the keys used for merging are consistent and that we handle missing
values appropriately.
Types of Merges
The most common method for merging data is through a process called
“joining”. There are several types of joins.
• Inner Join: Uses a comparison operator to match rows from two tables that
are based on the values in common columns from each table.
• Left join/left outer join. Returns all the rows from the left table that are
specified in the left outer join clause, not just the rows in which the columns
match.
• Right join/right outer join Returns all the rows from the right table that are
specified in the right outer join clause, not just the rows in which the
columns match.
Continue…
• Full outer join Returns all the rows in both the left and right tables.
• Cross joins (cartesian join) Returns all possible combinations of rows from
two tables.
Thanks for Watching!

Data Preprocessing- Feature Selection and Merging.

  • 1.
  • 2.
    Data Preprocessing Data Integration Data Transforma tion Data Reduction or dimension reduction Data Cleaning Scaling, Normalization, Categorical Encoding Handlingmissing values Outliers, duplicates Selecting relevant features/Column, Data Combining multiple datasets/ Merging
  • 3.
    Feature Selection A featureis an attribute that has an impact on a problem or is useful for the problem, and choosing the important features for the model is known as feature selection. Feature selection is often performed to remove irrelevant or redundant features from the dataset. We can define feature Selection as, "It is a process of automatically or manually selecting the subset of most appropriate and relevant features to be used in model building." Feature selection is performed by either including the important features or excluding the irrelevant features in the dataset without changing them.
  • 4.
    Feature Selection Techniques •Supervised Feature Selection technique : Supervised Feature selection techniques consider the target variable and can be used for the labeled dataset. • Unsupervised Feature Selection technique: Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabeled dataset.
  • 5.
    Common techniques forselecting relevant features  Feature Importance  Recursive Feature Elimination(RFE)  Forward/Backward Elimination  Principal Component Analysis (PCA)  Filter Method  Domain Knowledge
  • 6.
    Feature Extraction Feature extractioninvolves transforming the original features into a new set of features through mathematical transformations or projections. Feature selection involves selecting a subset of the original features based on their relevance, while feature extraction involves transforming the original features into a new set of features. Both techniques are used for dimensionality reduction to improve model performance, reduce overfitting, and enhance interpretability.
  • 7.
    Merging: Combining MultipleDatasets Merging also known as joining, is a fundamental operation in data science where we combine data from multiple datasets based on a common attribute or key. Merging is essential when dealing with essential datasets or when integrating data from multiple sources. In merging it is important to ensure that the keys used for merging are consistent and that we handle missing values appropriately.
  • 8.
    Types of Merges Themost common method for merging data is through a process called “joining”. There are several types of joins. • Inner Join: Uses a comparison operator to match rows from two tables that are based on the values in common columns from each table. • Left join/left outer join. Returns all the rows from the left table that are specified in the left outer join clause, not just the rows in which the columns match. • Right join/right outer join Returns all the rows from the right table that are specified in the right outer join clause, not just the rows in which the columns match.
  • 9.
    Continue… • Full outerjoin Returns all the rows in both the left and right tables. • Cross joins (cartesian join) Returns all possible combinations of rows from two tables.
  • 10.