Data science process -fundamentals of data science

DATA SCIENCE PROCESS: RETRIEVING
DATA-CLEANSING, INTEGRATING AND
TRANSFORMING DATA
DR.M.ARIVUKARASI, PROFESSOR
SIMATS UNIVERSITY

INTRODUCTION TO DATA SCIENCE PROCESS
• Steps in Data Science:
• Problem definition
• Data retrieval
• Data preparation (cleansing, integration, transformation)
• Modeling
• Evaluation
• Deployment

WHY DATA PREPARATION IMPORTANT
• 80% of a data scientist's time is spent on data preparation
• Poor-quality data leads to misleading results
• Clean, integrated, and transformed data ensures accurate models

OVERVIEW OF DATA PREPARATION
• Retrieving Data
• Data Cleansing
• Data Integration
• Data Transformation

DATA RETRIEVAL
• Collecting data from various sources:
• Databases
• APIs
• Files (CSV, JSON, XML)
• Web scraping
• Important to ensure correct data formats and sources

COMMON DATA RETRIEVAL TOOLS
• SQL
• Python (Pandas, requests)
• APIs (REST, GraphQL)
• Web scraping tools: BeautifulSoup, Scrapy
• Data warehousing tools: AWS Redshift, Google BigQuery

CHALLENGES IN DATA RETRIEVAL
• Inconsistent formats
• Missing data
• API limits
• Access control and permissions
• Data duplication

WHAT IS DATA CLEANSING?
• The process of detecting and correcting (or removing) corrupt or
inaccurate records
• Goals:
• Remove noise
• Handle missing values
• Correct inconsistencies

TECHNIQUES IN DATA CLEANSING
• Removing duplicates
• Handling missing values (imputation, deletion)
• Correcting typos and formatting
• Standardizing data (dates, case, units)

TOOLS FOR DATA CLEANSING
• OpenRefine
• TrifactaWrangler
• Python (Pandas, NumPy)
• R (dplyr, tidyr)
• Excel (Power Query)

DATA INTEGRATION – WHAT AND WHY?
• Combining data from multiple sources
• Objective:
• Create a unified view
• Remove redundancy
• Maintain consistency

METHODS OF DATA INTEGRATION
• Manual merging (Excel)
• ETL (Extract,Transform, Load)
• Data Warehousing
• API-based data consolidation
• Data virtualization

TOOLS FOR DATA INTEGRATION
• Apache NiFi
• Talend
• Informatica
• Microsoft SSIS
• Apache Camel

CHALLENGES IN INTEGRATION
• Different formats and schema
• Duplicate records
• Data conflicts (inconsistent values)
• Semantic mismatch

WHAT IS DATA TRANSFORMATION?
• Changing the format, structure, or values of data
• Makes data compatible with analysis or modeling
• Includes:
• Normalization
• Aggregation
• Encoding
• Feature scaling

TYPES OF DATA TRANSFORMATION
• Structural: changing columns, merging/splitting
• Syntactic: formatting, date/time conversion
• Semantic: converting categories into meaningful groups
• Encoding: label encoding, one-hot encoding

TOOLS FOR DATA TRANSFORMATION
• Pandas (Python)
• R (tidyverse)
• Talend
• Azure Data Factory
• Databricks

EXAMPLE – DATA PREP WORKFLOW
• Source: CSV + API
• Clean: Handle nulls, remove outliers
• Integrate: Combine by customer ID
• Transform: Normalize income, encode gender
• Output: Ready for ML pipeline

1. You are retrieving customer data from a CRM and sales database.
The ‘Customer_ID’ format is inconsistent between systems.What is
your first step in integration?
A) Merge the tables as is
B) Perform inner join without format change
C) Normalize 'Customer_ID' format before joining
D) Remove Customer_ID field

2. During data cleansing, you find 30% of values missing in a ‘Location’
column.Which action is most appropriate?
A) Drop the entire column
B) Replace all with “Unknown”
C) Analyze missing pattern and impute where logical
D) Drop rows with missing values

3.You receive a dataset with categorical variables:“Low”, “Medium”,
and “High”.Which transformation is most appropriate for machine
learning input?
A) Replace with 1, 2, 3
B) Use one-hot encoding
C) Discard the column
D) Leave as is

4.You are combining financial datasets from different countries with
currency fields.What is the best integration approach?
A) Drop the currency field
B) Convert all currencies to a base (e.g., USD)
C) Concatenate all values as strings
D) Standardize only numerical fields

5.Which scenario would require a data transformation rather than
cleansing or integration?
A) Removing duplicate records
B) Changing a date format from “MM/DD/YYYY” to “YYYY-MM-DD”
C) Merging two datasets
D) Joining customer and transaction data

6.You want to perform sentiment analysis on customer feedback data
stored in multiple text files.What is the most efficient initial step?
A) Run a clustering algorithm
B) Integrate all files into a single corpus
C) Visualize using pie charts
D) Clean HTML tags from output

7.You find two columns: "DOB" and "Age".Which cleansing method is
most appropriate?
A) Delete one of them
B) Convert "DOB" to "Age" and check for consistency
C) Replace “DOB” with today's date
D) Ignore both

8.Which of the following is a sign that transformation is needed instead
of cleansing?
A) Null values
B) Irregular spelling of category names
C) Categorical data needs to be fed into a neural network
D) Duplicate rows

9.You are working with IoT sensor data that arrives every second.What
transformation would help make it suitable for daily summary reports?
A) One-hot encoding
B) Log transformation
C) Aggregation by date
D) Z-score normalization

10. During data integration, you notice two datasets contain the same
column "Customer_ID" but one has 20,000 rows and another has 12,000.
What should you investigate?
A) Which has more nulls
B) Use full outer join to preserve all data
C) Whether "Customer_ID" is a primary key in both
D) Drop the larger dataset

Data science process -fundamentals of data science

More Related Content

Similar to Data science process -fundamentals of data science

Recently uploaded

Data science process -fundamentals of data science