What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Preprocessing
Manash Kumar Mondal
Department of Computer Science and Engineering
University of Kalyani
November, 2024
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 1 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 2 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 3 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
What is Data?
Data refers to raw facts, figures, or observations that can be
processed and analyzed to extract meaningful information.
• It can be numbers, text, images, or sound.
• Data can be structured (in databases) or unstructured (text,
images, etc.).
Example:
• A list of temperatures over a week: 25, 30, 28, 31, 29.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 4 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 5 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
What is Data Preprocessing?
Data preprocessing is the process of preparing raw data for analysis
and model training by cleaning, organizing, and transforming it
into a more suitable format:
• Identifying and correcting errors: Detecting and removing
inaccurate, incomplete, or irrelevant data
• Addressing issues: Addressing issues like missing values,
noise, inconsistencies, and outliers
• Extracting features: Extracting specific features from images
• Establishing standards: Establishing standards and best
practices for preparing data
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 6 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 7 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Preprocessing Steps
Figure 1: Data Preprocessing Steps
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 8 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Why Preprocess the Data?
Data pre-processing involves transforming raw data into a format
suitable for analysis.
• Why?
• Improve accuracy of models.
• Handle missing or inconsistent data.
• Make the data easier to work with.
• What does it involve?
• Data cleaning
• Data integration
• Data transformation
Example:
• A survey where some respondents skipped questions.
Preprocessing will handle missing values.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 9 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Outliers
Missing data
Erroneous data
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 10 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Cleaning
Data cleaning involves removing errors, inconsistencies, and
irrelevant data.
• Handle missing values
• Correct inconsistencies
• Remove duplicates
Example:
• Replace missing values in a dataset with the mean or median.
data.fillna(data.mean())
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 11 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Cleaning
Data cleaning helps in getting rid of commonly found errors and
mistakes in a data set. These are the 3 commonly found errors in
data.
• Outliers: Data points existing out of the range.
• Missing data: Data points missing at certain places.
• Erroneous data: Incorrect data points.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 12 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Outliers
• An outlier is a data point in
a dataset that is distant
from all other observations.
• An outlier is something that
behaves differently from the
combination/collection of
the data.
Figure 2: Outlier
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 13 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Missing data
What do these N/A values indicate?
They are the missing values in the data set. We can handle them
in several ways:
• By eliminating the rows of
missing values.
• Ignore the tuple
• Fill in the missing value
manually
• Use a global constant
• Use attribute mean
• Use the most probable value
Figure 3: Missing data
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 14 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Erroneous data
Erroneous data is data that is inconsistent, illogical, contradictory,
or out of range. It can also be data that a program cannot process
or should not accept.
• Incorrect
• Outside boundary tolerance
• Making use of incorrect data
type
• Making use of invalid
characters
Figure 4: Erroneous data generally
rejected
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 15 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Few Important Terms
• Discrepancy Detection (Human Error, Data Decay, Deliberate
Errors)
• Metadata
• Unique rule
• Null rule
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 16 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is the purpose of data cleaning?
1 To improve the speed of data processing
2 To get rid of commonly found errors and mistakes in a dataset
3 To collect new data points
4 To generate more data
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 17 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is "erroneous data"?
1 Data points that are missing from the dataset
2 Data points that are incorrect or invalid
3 Data points that are repetitive
4 Data points that are too large
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 18 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
Which of the following is considered an outlier?
1 A data point that is repeated multiple times
2 A data point existing out of the range
3 A data point that is entirely missing
4 A data point that contains invalid characters
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 19 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
Which of these errors are commonly addressed during data
cleaning?
1 Outliers
2 Missing data
3 Erroneous data
4 All of the aboves
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 20 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 21 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Integration
Data integration combines data from different sources into a
unified dataset.
• Merge tables from multiple
databases.
• Resolve conflicts between
data sources.
Example:
• Merging customer
information from two
different systems.
merged_data =
pd.merge(data1, data2,
on=’customer_id’)
Figure 5: Data Integration
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 22 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Integration issues
• Entity identification problem
• Redundancy
• Tuple Duplication
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 23 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Handling Redundant Data in Data Integration
Redundant data occur often when the integration of multiple
databases
• The same attribute may have different names in different
databases
• One attribute may be a derived attribute in another table,
e.g., annual revenue
• Redundant data may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 24 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 25 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Transformation
Data transformation involves converting data into a format that is
suitable for analysis.
• Normalize or standardize
data
• Apply aggregation
Example:
• Convert a salary column
from USD to EUR.
data[’salary_eur’] =
data[’salary_usd’] *
exchange_rate
Figure 6: Data Transformation
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 26 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 27 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 28 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Reduction
Data reduction reduces the volume of data while preserving its
important characteristics.
• Dimensionality reduction (e.g., PCA)
• Sampling
Example:
• Reducing the number of features in a dataset using PCA.
from sklearn.decomposition import PCA
Data Reduction - Strategies
• Data cube aggregation
• Dimension Reduction
• Data Compression
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 29 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Splitting
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 30 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Splitting
Data splitting divides the dataset into training, validation, and test
sets.
• Training set: for training the model
• Test set: for evaluating model
performance
• Validation set: for tuning
hyper-parameters
Example:
• Splitting a dataset into 80%
training and 20% testing.
train_data, test_data =
train_test_split(data,
test_size=0.2) Figure 7: Data Splitting
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 31 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
Which of the following is NOT a data reduction strategy?
1 Data cube aggregation
2 Dimension reduction
3 Data visualization
4 Data compression
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 32 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is data splitting?
1 Combining multiple datasets into one
2 Dividing a dataset into subsets for different purposes
3 Compressing a dataset to reduce size
4 Organizing a dataset into alphabetical order
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 33 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Quiz
What is the purpose of data reduction techniques?
1 To increase the volume of data
2 To represent data with a reduced size while maintaining
integrity
3 To eliminate unnecessary datasets
4 To enhance the speed of data collection
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 34 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
1 What is Data?
2 Data Preprocessing
3 Data Preprocessing Steps
4 Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 35 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Visualization
Data visualization involves creating graphs and charts to represent
data.
• Helps to understand patterns, trends, and insights.
• Types of visualization: bar charts, line charts, histograms, etc.
Example:
• Visualizing the distribution of ages in a dataset.
import matplotlib.pyplot as plt
plt.hist(data[’age’])
Applications: Presenting statistics, mapping, to show change over
time, to compare values, to show connections
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 36 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Data Visualization
By using visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and understand
trends and patterns in data.
Example:
• Histograms
• Bar graphs
• Pie charts
• Donut Charts
• Gantt charts
• Line graphs
• Map etc.
Figure 8: Data Visualization
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 37 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Conclusion
Data processing is an essential step in data science and machine
learning.
• Proper preprocessing leads to better model accuracy.
• The steps include data cleaning, integration, transformation,
and more.
Remember: Good data is the foundation of good insights.
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 38 / 39
What is Data? Data Preprocessing Data Preprocessing Steps Data Visualization
Thank You
Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani
Data Preprocessing 39 / 39

Introduction to Data Preprocessing for Machine Learning

  • 1.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Preprocessing Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani November, 2024 Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 1 / 39
  • 2.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 2 / 39
  • 3.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 3 / 39
  • 4.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization What is Data? Data refers to raw facts, figures, or observations that can be processed and analyzed to extract meaningful information. • It can be numbers, text, images, or sound. • Data can be structured (in databases) or unstructured (text, images, etc.). Example: • A list of temperatures over a week: 25, 30, 28, 31, 29. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 4 / 39
  • 5.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 5 / 39
  • 6.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization What is Data Preprocessing? Data preprocessing is the process of preparing raw data for analysis and model training by cleaning, organizing, and transforming it into a more suitable format: • Identifying and correcting errors: Detecting and removing inaccurate, incomplete, or irrelevant data • Addressing issues: Addressing issues like missing values, noise, inconsistencies, and outliers • Extracting features: Extracting specific features from images • Establishing standards: Establishing standards and best practices for preparing data Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 6 / 39
  • 7.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 7 / 39
  • 8.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Preprocessing Steps Figure 1: Data Preprocessing Steps Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 8 / 39
  • 9.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Why Preprocess the Data? Data pre-processing involves transforming raw data into a format suitable for analysis. • Why? • Improve accuracy of models. • Handle missing or inconsistent data. • Make the data easier to work with. • What does it involve? • Data cleaning • Data integration • Data transformation Example: • A survey where some respondents skipped questions. Preprocessing will handle missing values. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 9 / 39
  • 10.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Outliers Missing data Erroneous data Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 10 / 39
  • 11.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Cleaning Data cleaning involves removing errors, inconsistencies, and irrelevant data. • Handle missing values • Correct inconsistencies • Remove duplicates Example: • Replace missing values in a dataset with the mean or median. data.fillna(data.mean()) Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 11 / 39
  • 12.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Cleaning Data cleaning helps in getting rid of commonly found errors and mistakes in a data set. These are the 3 commonly found errors in data. • Outliers: Data points existing out of the range. • Missing data: Data points missing at certain places. • Erroneous data: Incorrect data points. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 12 / 39
  • 13.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Outliers • An outlier is a data point in a dataset that is distant from all other observations. • An outlier is something that behaves differently from the combination/collection of the data. Figure 2: Outlier Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 13 / 39
  • 14.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Missing data What do these N/A values indicate? They are the missing values in the data set. We can handle them in several ways: • By eliminating the rows of missing values. • Ignore the tuple • Fill in the missing value manually • Use a global constant • Use attribute mean • Use the most probable value Figure 3: Missing data Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 14 / 39
  • 15.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Erroneous data Erroneous data is data that is inconsistent, illogical, contradictory, or out of range. It can also be data that a program cannot process or should not accept. • Incorrect • Outside boundary tolerance • Making use of incorrect data type • Making use of invalid characters Figure 4: Erroneous data generally rejected Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 15 / 39
  • 16.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Few Important Terms • Discrepancy Detection (Human Error, Data Decay, Deliberate Errors) • Metadata • Unique rule • Null rule Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 16 / 39
  • 17.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is the purpose of data cleaning? 1 To improve the speed of data processing 2 To get rid of commonly found errors and mistakes in a dataset 3 To collect new data points 4 To generate more data Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 17 / 39
  • 18.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is "erroneous data"? 1 Data points that are missing from the dataset 2 Data points that are incorrect or invalid 3 Data points that are repetitive 4 Data points that are too large Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 18 / 39
  • 19.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz Which of the following is considered an outlier? 1 A data point that is repeated multiple times 2 A data point existing out of the range 3 A data point that is entirely missing 4 A data point that contains invalid characters Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 19 / 39
  • 20.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz Which of these errors are commonly addressed during data cleaning? 1 Outliers 2 Missing data 3 Erroneous data 4 All of the aboves Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 20 / 39
  • 21.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 21 / 39
  • 22.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Integration Data integration combines data from different sources into a unified dataset. • Merge tables from multiple databases. • Resolve conflicts between data sources. Example: • Merging customer information from two different systems. merged_data = pd.merge(data1, data2, on=’customer_id’) Figure 5: Data Integration Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 22 / 39
  • 23.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Integration issues • Entity identification problem • Redundancy • Tuple Duplication Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 23 / 39
  • 24.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Handling Redundant Data in Data Integration Redundant data occur often when the integration of multiple databases • The same attribute may have different names in different databases • One attribute may be a derived attribute in another table, e.g., annual revenue • Redundant data may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 24 / 39
  • 25.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 25 / 39
  • 26.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Transformation Data transformation involves converting data into a format that is suitable for analysis. • Normalize or standardize data • Apply aggregation Example: • Convert a salary column from USD to EUR. data[’salary_eur’] = data[’salary_usd’] * exchange_rate Figure 6: Data Transformation Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 26 / 39
  • 27.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 27 / 39
  • 28.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 28 / 39
  • 29.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Reduction Data reduction reduces the volume of data while preserving its important characteristics. • Dimensionality reduction (e.g., PCA) • Sampling Example: • Reducing the number of features in a dataset using PCA. from sklearn.decomposition import PCA Data Reduction - Strategies • Data cube aggregation • Dimension Reduction • Data Compression Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 29 / 39
  • 30.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps Data Cleaning Data Integration Data Transformation Data Reduction Data Splitting 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 30 / 39
  • 31.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Splitting Data splitting divides the dataset into training, validation, and test sets. • Training set: for training the model • Test set: for evaluating model performance • Validation set: for tuning hyper-parameters Example: • Splitting a dataset into 80% training and 20% testing. train_data, test_data = train_test_split(data, test_size=0.2) Figure 7: Data Splitting Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 31 / 39
  • 32.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz Which of the following is NOT a data reduction strategy? 1 Data cube aggregation 2 Dimension reduction 3 Data visualization 4 Data compression Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 32 / 39
  • 33.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is data splitting? 1 Combining multiple datasets into one 2 Dividing a dataset into subsets for different purposes 3 Compressing a dataset to reduce size 4 Organizing a dataset into alphabetical order Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 33 / 39
  • 34.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Quiz What is the purpose of data reduction techniques? 1 To increase the volume of data 2 To represent data with a reduced size while maintaining integrity 3 To eliminate unnecessary datasets 4 To enhance the speed of data collection Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 34 / 39
  • 35.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization 1 What is Data? 2 Data Preprocessing 3 Data Preprocessing Steps 4 Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 35 / 39
  • 36.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Visualization Data visualization involves creating graphs and charts to represent data. • Helps to understand patterns, trends, and insights. • Types of visualization: bar charts, line charts, histograms, etc. Example: • Visualizing the distribution of ages in a dataset. import matplotlib.pyplot as plt plt.hist(data[’age’]) Applications: Presenting statistics, mapping, to show change over time, to compare values, to show connections Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 36 / 39
  • 37.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Data Visualization By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends and patterns in data. Example: • Histograms • Bar graphs • Pie charts • Donut Charts • Gantt charts • Line graphs • Map etc. Figure 8: Data Visualization Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 37 / 39
  • 38.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Conclusion Data processing is an essential step in data science and machine learning. • Proper preprocessing leads to better model accuracy. • The steps include data cleaning, integration, transformation, and more. Remember: Good data is the foundation of good insights. Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 38 / 39
  • 39.
    What is Data?Data Preprocessing Data Preprocessing Steps Data Visualization Thank You Manash Kumar Mondal Department of Computer Science and Engineering University of Kalyani Data Preprocessing 39 / 39