Chapter 6 Preparing Data for Machine Learning.pptx

阮山松 – NGUYEN SON TUNG - F112169103
Chapter 6 Preparing Data
for Machine Learning

• Importance of data collection, transformation, and normalization for
machine learning.
• Convert data into vectors of numbers.
Introduction to Data Preparation

• Two types of data variables based on this distinction.
• Continuous variables : take any positive or negative real number.
• Discrete variables: take only a particular value from the allowed
set of possible values.
Types of Data Variables

• Variables can be thought as being on one of the four scales.
• Nominal Data: that cannot be ordered or measured. Used for
classification but not for comparison.
• Ordinal Data: with a meaningful order but no consistent intervals.
Supports ranking but lacks true numerical differences.
• Interval Data: with equal intervals between values, but no true
zero. Suitable for comparison of differences.
• Ratio Data: Has all properties of interval data and a natural zero.
Allows for comparison of both differences and ratios.
Types of Data Variables

• Translate raw data into numeric vectors.
• Example:
o Converting categorical data (like “Name”,“Gender”,...) into a form
the model can process
• Feature extraction for capturing meaningful information from raw
data.
Converting Data into Feature Vectors

Transforming Nominal Attributes
• Using sklearn.preprocessing and OneHotEncoder
• Example 1:
• Result:

Transforming Nominal Attributes
• Using sklearn.preprocessing and OneHotEncoder
• Example 2: Result:

Transforming Ordinal Attributes
• Using OrdinalEncoder to assign numerical values in order.
• Example 1:
• Result:

Normalization
• Normalize information to make certain constant characteristic range,
especially for algorithms touchy to distribution or distance
calculations.
• Example:
• Result:

Min-Max Scaling
• Min-max scaling normalizes features to a range of 0 to 1, with the
minimum value as 0 and the maximum as 1. The transformation can
be customized.
• Example: Result:

Standard Scaling
• Standard scaling transforms features by calculating z-scores, ensuring
they have a mean of 0 and a standard deviation of 1..
• Example: Result:

• Text data in machine learning often unstructured.
• Steps in converting text to a format suitable for ML.
• Natural Language Processing (NLP) basics and NLTK library (Natural
Language ToolKit).
• Check the installation of NLTK. from
nltk.tokenize import word_tokenize
• If you see such errors, simply run these lines:
import nltk
nltk.download()
Preprocessing Text

• Five steps: Segmentation, Tokenization, Stemming, Stopword
Removal, and Word Vector Creation.
• Pipeline to prepare text for machine learning.
• Each step’s contribution to extracting meaning from text.
Five-Step NLP Pipeline

• Segmentation breaking text into sentences and Tokenization breaking
sentences into words.
• Discover underlying patterns or groupings within the data.
• Example: using word_tokenize
• Result:
1,2. Segmentation and Tokenization

• Reducing words to their root forms.
• Stemming: truncating words to base form (e.g., “worked” → “work”).
• Example: using Porter Stemmer’s implementation.
• Result:
3. Stemming and Lemmatization

• High-frequency, low-meaning words (e.g., “and,” “the”) removed.
• Reduces dimensionality and noise.
• Example: filtering out stopwords with NLTK’s English stopword list.
• Result:
4. Removing Stopwords

• Text converted into vectors by counting word frequency.
• Example: CountVectorizer to Result:
represent words as features.
5. Preparing Word Vectors

• Representing images as arrays of pixel values.
• Example 1: Using Matplotlib’s imread function.
Preprocessing Images

• Result:

• Example 2: Using OpenCV and Result:
Scikit-Image.

• Example 3: Using matplotlib.pyplot Result:
with Google Colab.

Data preparation stands as the cornerstone of successful machine
learning, demanding significant time and expertise. This process
encompasses understanding data types, handling missing values,
implementing proper encoding techniques, and performing
normalization to prevent bias. The effective use of essential libraries
and skillful feature engineering ultimately leads to more robust and
accurate models.
Personal Reflections on this chapter

Chapter 6 Preparing Data for Machine Learning.pptx

More Related Content

Similar to Chapter 6 Preparing Data for Machine Learning.pptx

Recently uploaded

Chapter 6 Preparing Data for Machine Learning.pptx