DATA SCIENCE PROCESS: RETRIEVING
DATA-CLEANSING, INTEGRATING AND
TRANSFORMING DATA
DR.M.ARIVUKARASI, PROFESSOR
SIMATS UNIVERSITY
INTRODUCTION TO DATA SCIENCE PROCESS
• Steps in Data Science:
• Problem definition
• Data retrieval
• Data preparation (cleansing, integration, transformation)
• Modeling
• Evaluation
• Deployment
WHY DATA PREPARATION IMPORTANT
• 80% of a data scientist's time is spent on data preparation
• Poor-quality data leads to misleading results
• Clean, integrated, and transformed data ensures accurate models
OVERVIEW OF DATA PREPARATION
• Retrieving Data
• Data Cleansing
• Data Integration
• Data Transformation
DATA RETRIEVAL
• Collecting data from various sources:
• Databases
• APIs
• Files (CSV, JSON, XML)
• Web scraping
• Important to ensure correct data formats and sources
COMMON DATA RETRIEVAL TOOLS
• SQL
• Python (Pandas, requests)
• APIs (REST, GraphQL)
• Web scraping tools: BeautifulSoup, Scrapy
• Data warehousing tools: AWS Redshift, Google BigQuery
CHALLENGES IN DATA RETRIEVAL
• Inconsistent formats
• Missing data
• API limits
• Access control and permissions
• Data duplication
WHAT IS DATA CLEANSING?
• The process of detecting and correcting (or removing) corrupt or
inaccurate records
• Goals:
• Remove noise
• Handle missing values
• Correct inconsistencies
TECHNIQUES IN DATA CLEANSING
• Removing duplicates
• Handling missing values (imputation, deletion)
• Correcting typos and formatting
• Standardizing data (dates, case, units)
TOOLS FOR DATA CLEANSING
• OpenRefine
• TrifactaWrangler
• Python (Pandas, NumPy)
• R (dplyr, tidyr)
• Excel (Power Query)
DATA INTEGRATION – WHAT AND WHY?
• Combining data from multiple sources
• Objective:
• Create a unified view
• Remove redundancy
• Maintain consistency
METHODS OF DATA INTEGRATION
• Manual merging (Excel)
• ETL (Extract,Transform, Load)
• Data Warehousing
• API-based data consolidation
• Data virtualization
TOOLS FOR DATA INTEGRATION
• Apache NiFi
• Talend
• Informatica
• Microsoft SSIS
• Apache Camel
CHALLENGES IN INTEGRATION
• Different formats and schema
• Duplicate records
• Data conflicts (inconsistent values)
• Semantic mismatch
WHAT IS DATA TRANSFORMATION?
• Changing the format, structure, or values of data
• Makes data compatible with analysis or modeling
• Includes:
• Normalization
• Aggregation
• Encoding
• Feature scaling
TYPES OF DATA TRANSFORMATION
• Structural: changing columns, merging/splitting
• Syntactic: formatting, date/time conversion
• Semantic: converting categories into meaningful groups
• Encoding: label encoding, one-hot encoding
TOOLS FOR DATA TRANSFORMATION
• Pandas (Python)
• R (tidyverse)
• Talend
• Azure Data Factory
• Databricks
EXAMPLE – DATA PREP WORKFLOW
• Source: CSV + API
• Clean: Handle nulls, remove outliers
• Integrate: Combine by customer ID
• Transform: Normalize income, encode gender
• Output: Ready for ML pipeline
MCQ
1. You are retrieving customer data from a CRM and sales database.
The ‘Customer_ID’ format is inconsistent between systems.What is
your first step in integration?
A) Merge the tables as is
B) Perform inner join without format change
C) Normalize 'Customer_ID' format before joining
D) Remove Customer_ID field
2. During data cleansing, you find 30% of values missing in a ‘Location’
column.Which action is most appropriate?
A) Drop the entire column
B) Replace all with “Unknown”
C) Analyze missing pattern and impute where logical
D) Drop rows with missing values
3.You receive a dataset with categorical variables:“Low”, “Medium”,
and “High”.Which transformation is most appropriate for machine
learning input?
A) Replace with 1, 2, 3
B) Use one-hot encoding
C) Discard the column
D) Leave as is
4.You are combining financial datasets from different countries with
currency fields.What is the best integration approach?
A) Drop the currency field
B) Convert all currencies to a base (e.g., USD)
C) Concatenate all values as strings
D) Standardize only numerical fields
5.Which scenario would require a data transformation rather than
cleansing or integration?
A) Removing duplicate records
B) Changing a date format from “MM/DD/YYYY” to “YYYY-MM-DD”
C) Merging two datasets
D) Joining customer and transaction data
6.You want to perform sentiment analysis on customer feedback data
stored in multiple text files.What is the most efficient initial step?
A) Run a clustering algorithm
B) Integrate all files into a single corpus
C) Visualize using pie charts
D) Clean HTML tags from output
7.You find two columns: "DOB" and "Age".Which cleansing method is
most appropriate?
A) Delete one of them
B) Convert "DOB" to "Age" and check for consistency
C) Replace “DOB” with today's date
D) Ignore both
8.Which of the following is a sign that transformation is needed instead
of cleansing?
A) Null values
B) Irregular spelling of category names
C) Categorical data needs to be fed into a neural network
D) Duplicate rows
9.You are working with IoT sensor data that arrives every second.What
transformation would help make it suitable for daily summary reports?
A) One-hot encoding
B) Log transformation
C) Aggregation by date
D) Z-score normalization
10. During data integration, you notice two datasets contain the same
column "Customer_ID" but one has 20,000 rows and another has 12,000.
What should you investigate?
A) Which has more nulls
B) Use full outer join to preserve all data
C) Whether "Customer_ID" is a primary key in both
D) Drop the larger dataset
THANKYOU

Data science process -fundamentals of data science

  • 1.
    DATA SCIENCE PROCESS:RETRIEVING DATA-CLEANSING, INTEGRATING AND TRANSFORMING DATA DR.M.ARIVUKARASI, PROFESSOR SIMATS UNIVERSITY
  • 2.
    INTRODUCTION TO DATASCIENCE PROCESS • Steps in Data Science: • Problem definition • Data retrieval • Data preparation (cleansing, integration, transformation) • Modeling • Evaluation • Deployment
  • 3.
    WHY DATA PREPARATIONIMPORTANT • 80% of a data scientist's time is spent on data preparation • Poor-quality data leads to misleading results • Clean, integrated, and transformed data ensures accurate models
  • 4.
    OVERVIEW OF DATAPREPARATION • Retrieving Data • Data Cleansing • Data Integration • Data Transformation
  • 5.
    DATA RETRIEVAL • Collectingdata from various sources: • Databases • APIs • Files (CSV, JSON, XML) • Web scraping • Important to ensure correct data formats and sources
  • 6.
    COMMON DATA RETRIEVALTOOLS • SQL • Python (Pandas, requests) • APIs (REST, GraphQL) • Web scraping tools: BeautifulSoup, Scrapy • Data warehousing tools: AWS Redshift, Google BigQuery
  • 7.
    CHALLENGES IN DATARETRIEVAL • Inconsistent formats • Missing data • API limits • Access control and permissions • Data duplication
  • 8.
    WHAT IS DATACLEANSING? • The process of detecting and correcting (or removing) corrupt or inaccurate records • Goals: • Remove noise • Handle missing values • Correct inconsistencies
  • 9.
    TECHNIQUES IN DATACLEANSING • Removing duplicates • Handling missing values (imputation, deletion) • Correcting typos and formatting • Standardizing data (dates, case, units)
  • 10.
    TOOLS FOR DATACLEANSING • OpenRefine • TrifactaWrangler • Python (Pandas, NumPy) • R (dplyr, tidyr) • Excel (Power Query)
  • 11.
    DATA INTEGRATION –WHAT AND WHY? • Combining data from multiple sources • Objective: • Create a unified view • Remove redundancy • Maintain consistency
  • 12.
    METHODS OF DATAINTEGRATION • Manual merging (Excel) • ETL (Extract,Transform, Load) • Data Warehousing • API-based data consolidation • Data virtualization
  • 13.
    TOOLS FOR DATAINTEGRATION • Apache NiFi • Talend • Informatica • Microsoft SSIS • Apache Camel
  • 14.
    CHALLENGES IN INTEGRATION •Different formats and schema • Duplicate records • Data conflicts (inconsistent values) • Semantic mismatch
  • 15.
    WHAT IS DATATRANSFORMATION? • Changing the format, structure, or values of data • Makes data compatible with analysis or modeling • Includes: • Normalization • Aggregation • Encoding • Feature scaling
  • 16.
    TYPES OF DATATRANSFORMATION • Structural: changing columns, merging/splitting • Syntactic: formatting, date/time conversion • Semantic: converting categories into meaningful groups • Encoding: label encoding, one-hot encoding
  • 17.
    TOOLS FOR DATATRANSFORMATION • Pandas (Python) • R (tidyverse) • Talend • Azure Data Factory • Databricks
  • 18.
    EXAMPLE – DATAPREP WORKFLOW • Source: CSV + API • Clean: Handle nulls, remove outliers • Integrate: Combine by customer ID • Transform: Normalize income, encode gender • Output: Ready for ML pipeline
  • 19.
  • 20.
    1. You areretrieving customer data from a CRM and sales database. The ‘Customer_ID’ format is inconsistent between systems.What is your first step in integration? A) Merge the tables as is B) Perform inner join without format change C) Normalize 'Customer_ID' format before joining D) Remove Customer_ID field
  • 21.
    2. During datacleansing, you find 30% of values missing in a ‘Location’ column.Which action is most appropriate? A) Drop the entire column B) Replace all with “Unknown” C) Analyze missing pattern and impute where logical D) Drop rows with missing values
  • 22.
    3.You receive adataset with categorical variables:“Low”, “Medium”, and “High”.Which transformation is most appropriate for machine learning input? A) Replace with 1, 2, 3 B) Use one-hot encoding C) Discard the column D) Leave as is
  • 23.
    4.You are combiningfinancial datasets from different countries with currency fields.What is the best integration approach? A) Drop the currency field B) Convert all currencies to a base (e.g., USD) C) Concatenate all values as strings D) Standardize only numerical fields
  • 24.
    5.Which scenario wouldrequire a data transformation rather than cleansing or integration? A) Removing duplicate records B) Changing a date format from “MM/DD/YYYY” to “YYYY-MM-DD” C) Merging two datasets D) Joining customer and transaction data
  • 25.
    6.You want toperform sentiment analysis on customer feedback data stored in multiple text files.What is the most efficient initial step? A) Run a clustering algorithm B) Integrate all files into a single corpus C) Visualize using pie charts D) Clean HTML tags from output
  • 26.
    7.You find twocolumns: "DOB" and "Age".Which cleansing method is most appropriate? A) Delete one of them B) Convert "DOB" to "Age" and check for consistency C) Replace “DOB” with today's date D) Ignore both
  • 27.
    8.Which of thefollowing is a sign that transformation is needed instead of cleansing? A) Null values B) Irregular spelling of category names C) Categorical data needs to be fed into a neural network D) Duplicate rows
  • 28.
    9.You are workingwith IoT sensor data that arrives every second.What transformation would help make it suitable for daily summary reports? A) One-hot encoding B) Log transformation C) Aggregation by date D) Z-score normalization
  • 29.
    10. During dataintegration, you notice two datasets contain the same column "Customer_ID" but one has 20,000 rows and another has 12,000. What should you investigate? A) Which has more nulls B) Use full outer join to preserve all data C) Whether "Customer_ID" is a primary key in both D) Drop the larger dataset
  • 30.