阮山松 – NGUYEN SON TUNG - F112169103
Chapter 6 Preparing Data
for Machine Learning
• Importance of data collection, transformation, and normalization for
machine learning.
• Convert data into vectors of numbers.
Introduction to Data Preparation
• Two types of data variables based on this distinction.
• Continuous variables : take any positive or negative real number.
• Discrete variables: take only a particular value from the allowed
set of possible values.
Types of Data Variables
• Variables can be thought as being on one of the four scales.
• Nominal Data: that cannot be ordered or measured. Used for
classification but not for comparison.
• Ordinal Data: with a meaningful order but no consistent intervals.
Supports ranking but lacks true numerical differences.
• Interval Data: with equal intervals between values, but no true
zero. Suitable for comparison of differences.
• Ratio Data: Has all properties of interval data and a natural zero.
Allows for comparison of both differences and ratios.
Types of Data Variables
• Translate raw data into numeric vectors.
• Example:
o Converting categorical data (like “Name”,“Gender”,...) into a form
the model can process
• Feature extraction for capturing meaningful information from raw
data.
Converting Data into Feature Vectors
Transforming Nominal Attributes
• Using sklearn.preprocessing and OneHotEncoder
• Example 1:
• Result:
Transforming Nominal Attributes
• Using sklearn.preprocessing and OneHotEncoder
• Example 2: Result:
Transforming Ordinal Attributes
• Using OrdinalEncoder to assign numerical values in order.
• Example 1:
• Result:
Normalization
• Normalize information to make certain constant characteristic range,
especially for algorithms touchy to distribution or distance
calculations.
• Example:
• Result:
Min-Max Scaling
• Min-max scaling normalizes features to a range of 0 to 1, with the
minimum value as 0 and the maximum as 1. The transformation can
be customized.
• Example: Result:
Standard Scaling
• Standard scaling transforms features by calculating z-scores, ensuring
they have a mean of 0 and a standard deviation of 1..
• Example: Result:
• Text data in machine learning often unstructured.
• Steps in converting text to a format suitable for ML.
• Natural Language Processing (NLP) basics and NLTK library (Natural
Language ToolKit).
• Check the installation of NLTK. from
nltk.tokenize import word_tokenize
• If you see such errors, simply run these lines:
import nltk
nltk.download()
Preprocessing Text
• Five steps: Segmentation, Tokenization, Stemming, Stopword
Removal, and Word Vector Creation.
• Pipeline to prepare text for machine learning.
• Each step’s contribution to extracting meaning from text.
Five-Step NLP Pipeline
• Segmentation breaking text into sentences and Tokenization breaking
sentences into words.
• Discover underlying patterns or groupings within the data.
• Example: using word_tokenize
• Result:
1,2. Segmentation and Tokenization
• Reducing words to their root forms.
• Stemming: truncating words to base form (e.g., “worked” → “work”).
• Example: using Porter Stemmer’s implementation.
• Result:
3. Stemming and Lemmatization
• High-frequency, low-meaning words (e.g., “and,” “the”) removed.
• Reduces dimensionality and noise.
• Example: filtering out stopwords with NLTK’s English stopword list.
• Result:
4. Removing Stopwords
• Text converted into vectors by counting word frequency.
• Example: CountVectorizer to Result:
represent words as features.
5. Preparing Word Vectors
• Representing images as arrays of pixel values.
• Example 1: Using Matplotlib’s imread function.
Preprocessing Images
• Representing images as arrays of pixel values.
• Result:
Preprocessing Images
• Representing images as arrays of pixel values.
• Example 2: Using OpenCV and Result:
Scikit-Image.
Preprocessing Images
• Representing images as arrays of pixel values.
• Example 3: Using matplotlib.pyplot Result:
with Google Colab.
Preprocessing Images
Data preparation stands as the cornerstone of successful machine
learning, demanding significant time and expertise. This process
encompasses understanding data types, handling missing values,
implementing proper encoding techniques, and performing
normalization to prevent bias. The effective use of essential libraries
and skillful feature engineering ultimately leads to more robust and
accurate models.
Personal Reflections on this chapter

Chapter 6 Preparing Data for Machine Learning.pptx

  • 1.
    阮山松 – NGUYENSON TUNG - F112169103 Chapter 6 Preparing Data for Machine Learning
  • 2.
    • Importance ofdata collection, transformation, and normalization for machine learning. • Convert data into vectors of numbers. Introduction to Data Preparation
  • 3.
    • Two typesof data variables based on this distinction. • Continuous variables : take any positive or negative real number. • Discrete variables: take only a particular value from the allowed set of possible values. Types of Data Variables
  • 4.
    • Variables canbe thought as being on one of the four scales. • Nominal Data: that cannot be ordered or measured. Used for classification but not for comparison. • Ordinal Data: with a meaningful order but no consistent intervals. Supports ranking but lacks true numerical differences. • Interval Data: with equal intervals between values, but no true zero. Suitable for comparison of differences. • Ratio Data: Has all properties of interval data and a natural zero. Allows for comparison of both differences and ratios. Types of Data Variables
  • 5.
    • Translate rawdata into numeric vectors. • Example: o Converting categorical data (like “Name”,“Gender”,...) into a form the model can process • Feature extraction for capturing meaningful information from raw data. Converting Data into Feature Vectors
  • 6.
    Transforming Nominal Attributes •Using sklearn.preprocessing and OneHotEncoder • Example 1: • Result:
  • 7.
    Transforming Nominal Attributes •Using sklearn.preprocessing and OneHotEncoder • Example 2: Result:
  • 8.
    Transforming Ordinal Attributes •Using OrdinalEncoder to assign numerical values in order. • Example 1: • Result:
  • 9.
    Normalization • Normalize informationto make certain constant characteristic range, especially for algorithms touchy to distribution or distance calculations. • Example: • Result:
  • 10.
    Min-Max Scaling • Min-maxscaling normalizes features to a range of 0 to 1, with the minimum value as 0 and the maximum as 1. The transformation can be customized. • Example: Result:
  • 11.
    Standard Scaling • Standardscaling transforms features by calculating z-scores, ensuring they have a mean of 0 and a standard deviation of 1.. • Example: Result:
  • 12.
    • Text datain machine learning often unstructured. • Steps in converting text to a format suitable for ML. • Natural Language Processing (NLP) basics and NLTK library (Natural Language ToolKit). • Check the installation of NLTK. from nltk.tokenize import word_tokenize • If you see such errors, simply run these lines: import nltk nltk.download() Preprocessing Text
  • 13.
    • Five steps:Segmentation, Tokenization, Stemming, Stopword Removal, and Word Vector Creation. • Pipeline to prepare text for machine learning. • Each step’s contribution to extracting meaning from text. Five-Step NLP Pipeline
  • 14.
    • Segmentation breakingtext into sentences and Tokenization breaking sentences into words. • Discover underlying patterns or groupings within the data. • Example: using word_tokenize • Result: 1,2. Segmentation and Tokenization
  • 15.
    • Reducing wordsto their root forms. • Stemming: truncating words to base form (e.g., “worked” → “work”). • Example: using Porter Stemmer’s implementation. • Result: 3. Stemming and Lemmatization
  • 16.
    • High-frequency, low-meaningwords (e.g., “and,” “the”) removed. • Reduces dimensionality and noise. • Example: filtering out stopwords with NLTK’s English stopword list. • Result: 4. Removing Stopwords
  • 17.
    • Text convertedinto vectors by counting word frequency. • Example: CountVectorizer to Result: represent words as features. 5. Preparing Word Vectors
  • 18.
    • Representing imagesas arrays of pixel values. • Example 1: Using Matplotlib’s imread function. Preprocessing Images
  • 19.
    • Representing imagesas arrays of pixel values. • Result: Preprocessing Images
  • 20.
    • Representing imagesas arrays of pixel values. • Example 2: Using OpenCV and Result: Scikit-Image. Preprocessing Images
  • 21.
    • Representing imagesas arrays of pixel values. • Example 3: Using matplotlib.pyplot Result: with Google Colab. Preprocessing Images
  • 22.
    Data preparation standsas the cornerstone of successful machine learning, demanding significant time and expertise. This process encompasses understanding data types, handling missing values, implementing proper encoding techniques, and performing normalization to prevent bias. The effective use of essential libraries and skillful feature engineering ultimately leads to more robust and accurate models. Personal Reflections on this chapter