Data science process -fundamentals of data science
1.
DATA SCIENCE PROCESS:RETRIEVING
DATA-CLEANSING, INTEGRATING AND
TRANSFORMING DATA
DR.M.ARIVUKARASI, PROFESSOR
SIMATS UNIVERSITY
2.
INTRODUCTION TO DATASCIENCE PROCESS
• Steps in Data Science:
• Problem definition
• Data retrieval
• Data preparation (cleansing, integration, transformation)
• Modeling
• Evaluation
• Deployment
3.
WHY DATA PREPARATIONIMPORTANT
• 80% of a data scientist's time is spent on data preparation
• Poor-quality data leads to misleading results
• Clean, integrated, and transformed data ensures accurate models
4.
OVERVIEW OF DATAPREPARATION
• Retrieving Data
• Data Cleansing
• Data Integration
• Data Transformation
5.
DATA RETRIEVAL
• Collectingdata from various sources:
• Databases
• APIs
• Files (CSV, JSON, XML)
• Web scraping
• Important to ensure correct data formats and sources
6.
COMMON DATA RETRIEVALTOOLS
• SQL
• Python (Pandas, requests)
• APIs (REST, GraphQL)
• Web scraping tools: BeautifulSoup, Scrapy
• Data warehousing tools: AWS Redshift, Google BigQuery
7.
CHALLENGES IN DATARETRIEVAL
• Inconsistent formats
• Missing data
• API limits
• Access control and permissions
• Data duplication
8.
WHAT IS DATACLEANSING?
• The process of detecting and correcting (or removing) corrupt or
inaccurate records
• Goals:
• Remove noise
• Handle missing values
• Correct inconsistencies
9.
TECHNIQUES IN DATACLEANSING
• Removing duplicates
• Handling missing values (imputation, deletion)
• Correcting typos and formatting
• Standardizing data (dates, case, units)
10.
TOOLS FOR DATACLEANSING
• OpenRefine
• TrifactaWrangler
• Python (Pandas, NumPy)
• R (dplyr, tidyr)
• Excel (Power Query)
11.
DATA INTEGRATION –WHAT AND WHY?
• Combining data from multiple sources
• Objective:
• Create a unified view
• Remove redundancy
• Maintain consistency
12.
METHODS OF DATAINTEGRATION
• Manual merging (Excel)
• ETL (Extract,Transform, Load)
• Data Warehousing
• API-based data consolidation
• Data virtualization
13.
TOOLS FOR DATAINTEGRATION
• Apache NiFi
• Talend
• Informatica
• Microsoft SSIS
• Apache Camel
14.
CHALLENGES IN INTEGRATION
•Different formats and schema
• Duplicate records
• Data conflicts (inconsistent values)
• Semantic mismatch
15.
WHAT IS DATATRANSFORMATION?
• Changing the format, structure, or values of data
• Makes data compatible with analysis or modeling
• Includes:
• Normalization
• Aggregation
• Encoding
• Feature scaling
16.
TYPES OF DATATRANSFORMATION
• Structural: changing columns, merging/splitting
• Syntactic: formatting, date/time conversion
• Semantic: converting categories into meaningful groups
• Encoding: label encoding, one-hot encoding
17.
TOOLS FOR DATATRANSFORMATION
• Pandas (Python)
• R (tidyverse)
• Talend
• Azure Data Factory
• Databricks
18.
EXAMPLE – DATAPREP WORKFLOW
• Source: CSV + API
• Clean: Handle nulls, remove outliers
• Integrate: Combine by customer ID
• Transform: Normalize income, encode gender
• Output: Ready for ML pipeline
1. You areretrieving customer data from a CRM and sales database.
The ‘Customer_ID’ format is inconsistent between systems.What is
your first step in integration?
A) Merge the tables as is
B) Perform inner join without format change
C) Normalize 'Customer_ID' format before joining
D) Remove Customer_ID field
21.
2. During datacleansing, you find 30% of values missing in a ‘Location’
column.Which action is most appropriate?
A) Drop the entire column
B) Replace all with “Unknown”
C) Analyze missing pattern and impute where logical
D) Drop rows with missing values
22.
3.You receive adataset with categorical variables:“Low”, “Medium”,
and “High”.Which transformation is most appropriate for machine
learning input?
A) Replace with 1, 2, 3
B) Use one-hot encoding
C) Discard the column
D) Leave as is
23.
4.You are combiningfinancial datasets from different countries with
currency fields.What is the best integration approach?
A) Drop the currency field
B) Convert all currencies to a base (e.g., USD)
C) Concatenate all values as strings
D) Standardize only numerical fields
24.
5.Which scenario wouldrequire a data transformation rather than
cleansing or integration?
A) Removing duplicate records
B) Changing a date format from “MM/DD/YYYY” to “YYYY-MM-DD”
C) Merging two datasets
D) Joining customer and transaction data
25.
6.You want toperform sentiment analysis on customer feedback data
stored in multiple text files.What is the most efficient initial step?
A) Run a clustering algorithm
B) Integrate all files into a single corpus
C) Visualize using pie charts
D) Clean HTML tags from output
26.
7.You find twocolumns: "DOB" and "Age".Which cleansing method is
most appropriate?
A) Delete one of them
B) Convert "DOB" to "Age" and check for consistency
C) Replace “DOB” with today's date
D) Ignore both
27.
8.Which of thefollowing is a sign that transformation is needed instead
of cleansing?
A) Null values
B) Irregular spelling of category names
C) Categorical data needs to be fed into a neural network
D) Duplicate rows
28.
9.You are workingwith IoT sensor data that arrives every second.What
transformation would help make it suitable for daily summary reports?
A) One-hot encoding
B) Log transformation
C) Aggregation by date
D) Z-score normalization
29.
10. During dataintegration, you notice two datasets contain the same
column "Customer_ID" but one has 20,000 rows and another has 12,000.
What should you investigate?
A) Which has more nulls
B) Use full outer join to preserve all data
C) Whether "Customer_ID" is a primary key in both
D) Drop the larger dataset