Roadmap to Membership of RICS - Pathways and Routes
CS8080_IRT_UNIT - III T8 FEATURE SELECTION OR DIMENSIONALITY REDUCTION.pdf
1. P1WU
UNIT – III: CLASSIFICATION
Topic 8 FEATURE SELECTION OR
DIMENSIONALITY REDUCTION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
2. UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality
Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
3. FEATURE SELECTION OR DIMENSIONALITY REDUCTION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
4. FEATURE SELECTION OR DIMENSIONALITY REDUCTION
• Feature selection and dimensionality reduction allow us to
• minimize the number of features in a dataset by only keeping features that are
important.
• In other words, we want to retain features that contain the
most useful information that is needed by our model to
• make accurate predictions while discarding redundant features that contain
little to no information.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
5. FEATURE SELECTION OR DIMENSIONALITY REDUCTION
• There are several benefits in performing feature selection and
dimensionality reduction which include:
• model interpretability,
• minimizing overfitting
as well as
• reducing the size of the training set and consequently training time.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
6. Dimensionality Reduction
• The number of input variables or features for a dataset is referred to
as its dimensionality.
• Dimensionality reduction refers to techniques that reduce the number
of input variables in a dataset.
• More input features often make a predictive modeling task more
challenging to model, more generally referred to as the curse of
dimensionality.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
7. Dimensionality Reduction
• High-dimensionality statistics and dimensionality reduction
techniques are often used for data visualization.
• Nevertheless these techniques can be used in applied machine
learning to
• simplify a classification or regression dataset in order to better fit a predictive
model.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
8. Problem With Many Input Variables
• If your data is represented using rows and columns, such as in a
spreadsheet, then
• the input variables are the columns that are fed as input to a model to predict
the target variable. Input variables are also called features.
• We can consider the columns of data representing dimensions on an
n-dimensional feature space and the rows of data as points in that
space.
• This is a useful geometric interpretation of a dataset.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
9. Problem With Many Input Variables
• Having a large number of dimensions in the feature space can mean that
the volume of that space is very large, and in turn,
• the points that we have in that space (rows of data) often represent a small and non-
representative sample.
• This can dramatically impact the performance of machine learning
algorithms fit on data with many input features, generally referred to as the
“curse of dimensionality.”
• Therefore, it is often desirable to reduce the number of input features. This
reduces the number of dimensions of the feature space, hence the name
“dimensionality reduction.”
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
10. Dimensionality Reduction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
11. Dimensionality Reduction
• Dimensionality reduction refers to techniques for reducing the
number of input variables in training data.
• When dealing with high dimensional data, it is often useful to reduce
the dimensionality by projecting the data to a lower dimensional
subspace which captures the “essence” of the data. This is called
dimensionality reduction.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
12. Dimensionality Reduction
• Fewer input dimensions often mean correspondingly fewer
parameters or a simpler structure in the machine learning model,
referred to as degrees of freedom.
• A model with too many degrees of freedom is likely to overfit the
training dataset and therefore may not perform well on new data.
• It is desirable to have simple models that generalize well, and in turn,
input data with few input variables.
• This is particularly true for linear models where the number of inputs
and the degrees of freedom of the model are often closely related.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
13. Techniques for Dimensionality Reduction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
14. Techniques for Dimensionality Reduction
• There are many techniques that can be used for dimensionality
reduction.
• Feature Selection Methods
• Matrix Factorization
• Manifold Learning
• Auto encoder Methods
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
15. Feature Selection Methods
• Feature selection is also called variable selection or attribute
selection.
• It is the automatic selection of attributes in your data (such as
columns in tabular data) that are most relevant to the predictive
modeling problem you are working on.
• feature selection…
• is the process of selecting a subset of relevant features for use in model
construction
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
16. Feature Selection Methods
• Feature selection is different from dimensionality reduction.
• Both methods seek to reduce the number of attributes in the dataset,
• but a dimensionality reduction method do so by creating new combinations of
attributes,
• where as feature selection methods include and exclude attributes present in
the data without changing them.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
17. Feature Selection Methods
• Examples of dimensionality reduction methods include
• Principal Component Analysis,
• Singular Value Decomposition and
• Sammon’s Mapping.
• Feature selection is itself useful, but it mostly acts as a filter, muting
out features that aren’t useful in addition to your existing features.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
18. Feature Selection Algorithms
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
19. Filter Methods
• Filter feature selection methods apply a statistical measure to assign a
scoring to each feature.
• The features are ranked by the score and either selected to be kept or
removed from the dataset.
• The methods are often univariate and consider the feature
independently, or with regard to the dependent variable.
• Some examples of some filter methods include the Chi squared test,
information gain and correlation coefficient scores.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
20. Wrapper Methods
• Wrapper methods consider the selection of a set of features as a search problem,
where different combinations are prepared, evaluated and compared to other
combinations.
• A predictive model us used to evaluate a combination of features and assign a
score based on model accuracy.
• The search process may be
• methodical such as a best-first search,
• it may stochastic such as a random hill-climbing algorithm, or
• it may use heuristics, like forward and backward passes to add and remove features.
• An example if a wrapper method is the recursive feature elimination algorithm.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
21. Embedded Methods
• Embedded methods learn which features best contribute to the accuracy of the
model while the model is being created.
• The most common type of embedded feature selection methods are
• regularization methods.
• Regularization methods are also called penalization methods that introduce
• additional constraints into the optimization of a predictive algorithm (such as a regression
algorithm) that bias the model toward lower complexity (fewer coefficients).
• Examples of regularization algorithms are the
• LASSO, Elastic Net and Ridge Regression.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
22. Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES