CS 3351
FOUNDATIONS OF DATA SCIENCE
PREPARED BY : MR. K. DANIEL RAJ B.TECH.,M.E.,MISTE.,
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ( UG & PG )
Data Science: Benefits and uses – facets of data - Data Science
Process: Overview – Defining research goals – Retrieving data –
Data preparation Exploratory Data analysis – build the model
presenting findings and building applications - Data Mining - Data
Warehousing – Basic Statistical descriptions of Data
INTRODUCTION
UNIT 1
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.1 DATA SCIENCE
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
WHAT IS DATA SCIENCE?
Interdisciplinary field to extract knowledge and actionable insights from data.
Combines math, statistics, programming, AI, and subject expertise.
Predicts future trends using historical data and machine learning.
Used for business decisions, research, and automation of intelligence tasks.
LIFE CYCLE OF DATA SCIENCE
1.Capture: Data acquisition and extraction.
2.Maintain: Data warehousing, cleansing, and processing.
3.Process: Mining, clustering, classification, modeling.
4.Analyze: Reporting, visualization, BI, decision making.
5.Communicate: Predictive analysis, text mining, qualitative analysis.
UNIT 1 : INTRODUCTION
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.1.1 BIG DATA
Large volumes of complex data generated rapidly from varied sources.
Cannot be processed effectively with traditional methods.
Characterized by the "3 V's": Volume, Velocity, Variety.
1.1.2 Characteristics of Big Data
Characteristic Description
Volume Extremely large data amounts (terabytes to petabytes).
Velocity Speed of data generation and processing (near real-time).
Variety Different data types: structured, unstructured, streaming.
Veracity Trustworthiness and credibility of data sources.
Value Business value derived from data analysis.
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.1.3 DIFFERENCE BETWEEN DATA SCIENCE AND BIG DATA
Aspect Data Science Big Data
Definition Extraction of insights and actionable
knowledge from data.
Management, storage, and processing of very
large and complex datasets.
Primary Focus Analyzing data using algorithms, statistics,
machine learning.
Handling high volume, velocity, and variety of
data.
Goal Create predictive models and support decision-
making.
Efficient data storage, retrieval, and basic data
processing.
Data Type Cleaned, processed, and prepared data for
modeling.
Raw, often unstructured and rapidly generated
data.
Techniques Used Statistical analysis, machine learning, data
mining.
Distributed computing (e.g., Hadoop, Spark),
NoSQL databases.
Tools Python, R, SAS, TensorFlow, Scikit-learn. Hadoop ecosystem, Spark, Cassandra,
MongoDB, Apache Kafka.
Outcome Actionable insights, forecasts, and data-driven
decisions.
Large-scale data repositories and infrastructure
to support analytics.
Skillsets Needed Data science, statistics, programming, domain
expertise.
Data engineering, distributed computing,
database management.
Applications Fraud detection, recommendation systems,
predictive analytics.
Social media analytics, IoT data handling, real-
time data streams.
UNIT 1 : INTRODUCTION
1.1 DATA SCIENCE
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.1.4 COMPARISON BETWEEN CLOUD COMPUTING AND BIG DATA
Parameter Big Data Cloud Computing
Definition
Refers to very large, complex, and rapidly growing data
sets that traditional tools can't handle.
On-demand availability of computing resources (servers,
storage, software) over the internet.
Focus
Managing, storing, and analyzing massive volumes of
structured, semi-structured, and unstructured data.
Providing scalable IT infrastructure and services remotely
to users via the internet.
Core Characteristics Volume, Velocity, Variety, Veracity, Value (the 5 V’s).
On-demand self-service, broad network access, resource
pooling, rapid elasticity, and measured service.
Purpose
Extract useful information and insights to improve
business decisions.
Provide flexible, scalable, and cost-effective computing
resources without physical infrastructure setup.
Data Handling
Requires distributed computing platforms (e.g., Hadoop,
Spark) to process large datasets efficiently.
Uses virtualized resources hosted on cloud vendors
(AWS, Azure, Google Cloud).
Services Offered Data storage, data mining, analysis, visualization.
Infrastructure as a Service (IaaS), Platform as a Service
(PaaS), Software as a Service (SaaS).
Usage Example
Analyzing social media data, stock market trends, sensor
data.
Hosting apps, databases, backup solutions, real-time
collaboration tools.
Challenges
Data integration, storage complexity, ensuring data
quality, and real-time processing.
Vendor lock-in, security and privacy concerns, downtime
risks, dependency on internet connectivity.
UNIT 1 : INTRODUCTION
1.1 DATA SCIENCE
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
DATA SCIENCE VS BIG DATA
Data Science: Focus on extracting insights, predictions, and business intelligence from data.
Big Data: Refers to very large datasets that require specialized storage and processing technologies.
1.1.5 Benefits and Uses of Data Science
Anomaly detection (fraud, disease).
Classification (email sorting, background checks).
Forecasting (sales, customer retention).
Pattern detection (weather, financial markets).
Recognition (face, voice, text).
Recommendation systems.
Regression and optimization tasks.
UNIT 1 : INTRODUCTION
1.1 DATA SCIENCE
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
DATA SCIENCE VS BIG DATA
Data Science: Focus on extracting insights, predictions, and business intelligence from data.
Big Data: Refers to very large datasets that require specialized storage and processing technologies.
1.1.6 Benefits and Uses of Big Data
Improved customer service and decision-making.
Reduced maintenance costs and operational efficiency.
Early risk identification and product marketing optimization.
Examples: Social media data, stock exchange trades,compliance
reports.
UNIT 1 : INTRODUCTION
1.1 DATA SCIENCE
Very large amount of data will generate in big data and data science.
These data is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.2 FACETS OF DATA
UNIT 1 : INTRODUCTION
1.2.1 STRUCTURED DATA
Data organized in rows and columns, making it easy to store, search, and process.
Stored using database management systems.
Example: Excel tables, relational databases.
1.2.2 UNSTRUCTURED DATA
Data without a predefined format or structure.
Includes text, audio, video, images, and documents.
More than 80% of organizational data is unstructured.
Example: Email messages, customer feedback, photos.
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.2 FACETS OF DATA
UNIT 1 : INTRODUCTION
1.2.3 NATURAL LANGUAGE
A subset of unstructured data consisting of human language text or speech.
Processed using Natural Language Processing (NLP) for recognizing, understanding,
and generating language.
Applications: Sentiment analysis, speech recognition, machine translation.
1.2.4 MACHINE-GENERATED DATA
Data created automatically by systems or devices without human input.
Includes logs, telemetry, API data, and sensor outputs.
Can be structured or unstructured.
Example: Web server logs, network event logs.
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.2 FACETS OF DATA
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.2 FACETS OF DATA
1.2.5 GRAPH-BASED DATA
Data representing entities (nodes) and their relationships (edges).
Used in social networks, fraud detection, and recommendation systems.
Stored in graph databases and queried via specialized languages like SPARQL.
Enables relationship pattern detection and complex network analysis.
1.2.6 AUDIO, IMAGE, AND VIDEO DATA
Multimedia data types carrying sound and visual information.
Requires advanced data science techniques for recognition and analysis.
Includes challenges like large file sizes, diverse formats, and integration.
Example: Photos, recorded videos, audio files.
1.2.7 STREAMING DATA
Continuous, real-time data generated by thousands of sources simultaneously.
Examples: Social media feeds, financial transactions, IoT telemetry.
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.2 FACETS OF DATA
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.2.8 DIFFERENCE BETWEEN STRUCTURED AND UNSTRUCTURED DATA
Aspect Structured Data Unstructured Data
Format
Organized tables (rows &
columns)
No fixed format or
organization
Example
Database records, Excel
sheets
Emails, videos, social media
posts
Ease of Processing
Easily processed with SQL
queries
Requires advanced tools
like NLP, image processing
Usage
Transactional, operational
data
Media files, natural
language, sensor data
UNIT 1 : INTRODUCTION
1.2 FACETS OF DATA
1.3 DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
Discovery or Setting the
research goal
Retrieving data
Data preparation
Data exploration
Data modeling
Presentation and automation
4
5
6
1
2
3
Overview
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.4 DISCOVERY OR DEFINING RESEARCH GOAL
This step involves acquiring data from all the identified internal and
external sources, which helps to answer the business question.
1.Learning Business Domain - Understand the industry/domain context of the problem
2.Resources - Technology & Tools,Systems & Infrastructure,Data ,Team/People
3.Frame the Problem - Define a clear problem statement
4.Identify Stakeholders - List those impacted, involved, or benefited
5.Interview Analytics Sponsor - Sponsor = funding / high-level requirement provider
6.Develop Initial Hypotheses - Form testable hypotheses for investigation
7.Identify Data Sources - Check type, volume, and timespan of data
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.5 RETRIEVING DATA
It collection of data which required for project. This is the process of
gaining a business understanding of the data user have and deciphering
what each piece of data means.
1.Internal Data – Start with company-owned data stored in repositories.
2.Data Repositories – Structured systems like warehouses, lakes, marts for
storage.
3.External Data – Acquire missing data from third-party, public, or
government sources.
4.Data Quality Checks – Clean, validate, and verify data to ensure reliability.
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
Data can have many inconsistencies like missing values, blank columns, an
incorrect data format, which needs to be cleaned. We need to process, explore
and condition data before modeling. The cleandata, gives the better
predictions.
1.Data Cleaning
2.Outlier Detection
3.Handling Missing Values
4.Handling Noisy Data
5.Correct Errors Early
6.Combine Data
7.Transform Data
UNIT 1 : INTRODUCTION
Remove errors, duplicates, and inconsistencies.
Tasks:
Fill in missing values
Standardize formats (date/time, text case)
Convert nominal →numeric values
Detect & smooth noisy data
Fix issues like typos, whitespace errors, capitalization mismatches.
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.1 DATA CLEANING
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.2 OUTLIER DETECTION
Outlier = data point that deviates significantly from the norm.
Detection: plots, min-max values, statistical models.
Applications: fraud detection, intrusion detection, medical & industrial uses.
Outliers may indicate errors or special patterns.
UNIT 1 : INTRODUCTION
Missing values reduce accuracy & reliability.
Approaches:
1.Ignore tuple (if multiple attributes missing)
2.Fill manually (small datasets only)
3.Use constant (e.g., “Unknown”)
4.Replace with mean/median/mode
5.Use class-based averages
6.Predict most probable value
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.3 HANDLING MISSING VALUES
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.3.1 HANDLING NOISY DATA
Noisy data = errors or random variations in data.
Techniques:
1.Smoothing & filtering
2.Aggregating values
3.Binning (grouping into ranges)
4.Statistical methods to remove noise
UNIT 1 : INTRODUCTION
Correcting errors in later stages = costly & time-consuming.
Benefits of early cleanup:
1.Avoids repeated cleansing in multiple projects
2.Prevents wrong business decisions
3.Detects root causes (defective equipment, process gaps, bugs in
ETL).
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.4 CORRECT ERRORS EARLY
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.5 COMBINE DATA
1.Joining Tables: Combine related data (using keys).
2.Appending Tables: Stack datasets →larger combined set.
3.Using Views: Create virtual combinations to avoid duplication and
storage issues.
JOINING TABLES
APPENDING TABLES USING VIEWS
UNIT 1 : INTRODUCTION
Convert data into formats ready for modeling.
Methods:
1.Variable Reduction →remove less useful variables, retain useful ones.
2.Normalization →scale values for comparisons.
3.Dummy Variables →convert categorical →binary (0/1).
4.Use Euclidean distance for similarity measures (works best with fewer variables).
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.6 DATA PREPARATION
1.6.6 TRANSFORM DATA
EUCLIDEAN DISTANCE = √(X -X ) + (Y -Y )
1 2
2
1 2
2
DUMMY VARIABLES : 0 OR 1
S S
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
Data exploration is related to deeper understanding of data. Try to
understand how variables interact with each other, the distribution of
the data and whether there are outliers. This steps is also called as
Exploratory Data Analysis.
EDA = Process of analyzing datasets using summary statistics &
visualizations.
Goals: Understand data, detect patterns, spot anomalies, and guide
modeling.
Techniques: Univariate, Bivariate, Multivariate analysis.
UNIT 1 : INTRODUCTION
1.7.1 PURPOSE OF EDA
EDA helps data scientists to:
1.Maximize insight into the dataset
2.Uncover underlying structure
3.Extract key variables
4.Identify outliers & anomalies
5.Test assumptions
6.Develop simple (parsimonious) models
7.Optimize factor settings
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
UNIT 1 : INTRODUCTION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.7.2 KEY FUNCTIONS OF EDA
1.Describe data (basic stats, types)
2.Explore data distributions (shape, spread, skewness)
3.Check relationships between variables
4.Detect unusual patterns/outliers
5.Group data and compare across categories
6.Note differences & dependencies
UNIT 1 : INTRODUCTION
1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.7.3 EDA METHODS
Univariate Analysis →Focus on one variable (PDF, CDF,
boxplots).
Bivariate Analysis →Relationship between two variables
(scatterplots, violin plots, boxplots).
Multivariate Analysis →Explore interactions among 3+
variables (heatmaps, correlation plots, PCA).
UNIT 1 : INTRODUCTION
1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
1.7.4 BOXPLOTS IN EDA
Boxplots = visualize distribution, spread, and outliers.
Components:
1.Minimum value
2.Lower quartile (25%)
3.Median (50%)
4.Upper quartile (75%)
5.Maximum value
6.Whiskers (variability outside quartiles)
7.Interquartile range (middle 50%)
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
UNIT 1 : INTRODUCTION
1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
UNIT 1 : INTRODUCTION
1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.8 BUILD THE MODELS / DATA MODELING
In this step, the actual model building process starts. Here, Data
scientist distributes datasets for training and testing. Techniques like
association, classification and clustering are applied to the training
data set. The model, once prepared, is tested against the "testing"
dataset.
1.Model & Variable Selection
2.Model Execution
3.Model Diagnostics & Comparison
8Process is iterative – refine until best model is found.
UNIT 1 : INTRODUCTION
1.8.1 MODEL & VARIABLE SELECTION
Choose model type based on project goals.
Factors to consider:
Is the model production-ready & easy to implement?
How long will it stay relevant (maintenance)?
Does it need to be easily explainable?
Select right variables/features for prediction.
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.8 BUILD THE MODELS / DATA MODELING
UNIT 1 : INTRODUCTION
1.8.2 MODEL EXECUTION
Implement using languages & libraries:
Python (Scikit-learn, StatsModels)
R, SQL (MADlib), Octave, WEKA
Commercial tools: SAS, SPSS Modeler, Matlab, Alpine Miner
Evaluate results:
Model fit – R² / Adjusted R²
Coefficients – interpretation for linear models
Predictor significance – test influence of variables
Use appropriate technique: Regression (predict values) or Classification (predict categories).
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
UNIT 1 : INTRODUCTION
1.8 BUILD THE MODELS / DATA MODELING
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.8.3 MODEL DIAGNOSTICS & COMPARISON
Build multiple models and compare performance.
Use holdout method:
Split data into training (70–80%) & testing (20–30%).
Train on one set, test on the other.
Helps check generalization ability of model.
Drawbacks of Holdout:
1.Needs extra dataset.
2.Single split may give misleading error estimate.
UNIT 1 : INTRODUCTION
1.8 BUILD THE MODELS / DATA MODELING
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
EXAMPLE – HOUSE PRICE PREDICTION
Input variables: Square footage + Number of rooms
Target variable: House price
Workflow:
Train model on 20 rows
Test on 10 rows
Measure prediction accuracy
Shows how regression models are used in practice.
UNIT 1 : INTRODUCTION
1.8 BUILD THE MODELS / DATA MODELING
DATA SCIENCE PROCESS
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.9 PRESENTATION AND AUTOMATION
Deliver the final baselined model with reports, code and technical
documents in this stage. Model is deployed into a real-time production
environment after thorough testing.
1.Deliverables: Reports, briefings, technical docs, code
2.Pilot projects →Production deployment
3.Skills needed: Communication + Business alignment
4.Goal: Integrate models into tools & workflows
UNIT 1 : INTRODUCTION
1.10 DATA MINING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
Data mining refers to extracting or mining knowledge from large amounts of data. It is a process
of discovering interesting patterns or Knowledge from a large amount of data stored either in
databases, data warehouses or other information repositories.
WHY DATA MINING?
1.Knowledge Discovery – Find hidden patterns & correlations.
2.Data Visualization – Present data in meaningful ways.
3.Data Correction – Identify & fix errors/inconsistencies.
1.10.1 FUNCTIONS OF DATA MINING
1.Characterization – Summarize key features (e.g., student profiles).
2.Association – Discover rules/patterns (e.g., market basket analysis).
3.Classification – Build models to categorize data.
4.Prediction – Estimate unknown/future values.
5.Clustering – Group similar objects (taxonomy formation).
6.Evolution Analysis – Study patterns over time.
UNIT 1 : INTRODUCTION
1.10.2 Predictive Tasks:
Supervised learning →classification, regression, time-series.
Uses historical data to forecast outcomes.
Applications: credit scoring, customer behavior, demand forecasting.
1.10.3 Descriptive Tasks:
Summarize & visualize past data.
Techniques: aggregation, reporting, BI dashboards.
Helps understand trends & influence future actions.
1.10 DATA MINING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
MINING TASK TYPES
UNIT 1 : INTRODUCTION
1.10 DATA MINING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
MINING TASK TYPES
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.10.4 ARCHITECTURE OF A DATA MINING SYSTEM
UNIT 1 : INTRODUCTION
1.10 DATA MINING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
Data Sources – Databases, warehouses, web & files.
Data Warehouse Server – Fetches & integrates relevant data.
Knowledge Base – Stores prior knowledge, user beliefs, rules.
Mining Engine – Core component (classification, clustering, prediction).
Pattern Evaluation Module – Filters interesting/important patterns.
Graphical User Interface – User-friendly interaction & result display.
UNIT 1 : INTRODUCTION
1.10.4 ARCHITECTURE OF A DATA MINING SYSTEM
1.10 DATA MINING
1.10 DATA MINING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.10.5 CLASSIFICATION OF DATA MINING SYSTEMS
OTHER
DISCIPLINES
VISUALIZATION
INFORMATION
SCIENCE
DATA
BASE
STATISTICS
MACHINE
LEARNING
DATA
MINING
UNIT 1 : INTRODUCTION
Parameter Description
Classification Algorithms
Different algorithms used for classification such as Decision Trees,
SVM, Neural Networks, k-NN, Bayesian.
Learning Paradigm
Supervised (uses labeled data) vs Unsupervised (e.g., clustering, no
labeled data).
Classification Type Binary, Multi-class, Multi-label, Hierarchical (nested categories).
Approach
Discriminative (boundary-based) vs Generative (distribution
estimation).
Data Types
Structured (tables), Unstructured (text, images), Time-series, Spatial,
etc.
Application Domain
Domains like finance, healthcare, marketing, image recognition with
domain-specific features and models.
Tools & Platforms
Software examples: Python (Scikit-learn), R, WEKA, SAS, SPSS,
MATLAB, etc.
Model Complexity Simple (linear) to complex (deep learning, ensemble methods).
Evaluation Techniques
Metrics such as accuracy, precision, recall, F1-score, ROC, cross-
validation, holdout method.
1.10 DATA MINING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
MULTI-DIMENSIONAL VIEW OF DATA MINING CLASSIFICATION.
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11 DATA WAREHOUSING
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries and decision making. Data
warehousing involves data cleaning, data integration and data consolidations.
What is a Data Warehouse?
Subject-oriented, integrated, time-variant, non-volatile collection of data.
Stores historical data to support management decisions.
Separate from operational databases (focus on analysis vs transactions).
Key Characteristics of Data Warehouses
Characteristic Description
Subject-Oriented Organized by major business subjects (e.g., sales).
Integrated Consistent naming, formats, and value representations.
Non-volatile Data usually read-only, no deletions or real-time updates.
Time-variant Stores historical data over months or years.
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11 DATA WAREHOUSING
1.11.1 KEY CHARACTERISTICS OF DATA WAREHOUSES
Characteristic Description
Subject-Oriented
Organized by major business subjects (e.g.,
sales).
Integrated
Consistent naming, formats, and value
representations.
Non-volatile
Data usually read-only, no deletions or real-
time updates.
Time-variant Stores historical data over months or years.
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11 DATA WAREHOUSING
Goals of Data Warehousing
Facilitate reporting and analysis.
Maintain historical information.
Support strategic decision-making.
How Organizations Use Data Warehousing
Improve customer focus via buying pattern analysis.
Optimize product portfolios and reposition products.
Analyze operations for profit optimization.
Manage customer relationships and control costs.
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
DATA WAREHOUSE VS DATABASE
Feature Database Data Warehouse
Purpose
Supports day-to-day
operations (transactions).
Stores historical data for
analysis & reporting.
Data Type Current, real-time data.
Time-variant, historical
data.
Update Frequency
Frequent updates, inserts,
and deletes.
Mostly read-only, periodic
batch loads.
Structure
Application-oriented,
normalized for
i
Subject-oriented,
denormalized for analysis.
UNIT 1 : INTRODUCTION
1.11 DATA WAREHOUSING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11 DATA WAREHOUSING
1.Single-tier: Minimize data redundancy; limited scalability.
2.Two-tier: Separation of resources & warehouse; small scale.
3.Three-tier (Multi-tier):
Bottom tier: Warehouse database server (stores cleansed data).
Middle tier: OLAP server (arranges data for analysis using
ROLAP or MOLAP).
Top tier: Front-end tools for querying, reporting, visualization,
and analytics.
OLAPS can interact with both relational databases and
multidimensional databases, which lets them collect data
better based on broader parameters.
UNIT 1 : INTRODUCTION
1.11.2 MULTI TIER ARCHITECTURE OF DWH
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11 DATA WAREHOUSING
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11.3 NEEDS OF DATA WAREHOUSE
Business users need summarized, easy-to-understand data.
Store historical, time-variant data for trend analysis.
Support strategic decision-making.
Ensure data consistency across sources.
Provide high-speed query response for ad hoc demands.
1.11.4 BENEFITS OF DATA WAREHOUSING
Better understanding of business trends for forecasting.
Efficient handling of large data volumes with good performance.
Simplified data structure for end-user navigation & queries.
Easier to build and maintain complex queries compared to normalized databases.
Efficiently manage high demand for information.
UNIT 1 : INTRODUCTION
1.11 DATA WAREHOUSING
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
UNIT 1 : INTRODUCTION
1.11 DATA WAREHOUSING
1.11.5 DIFFERENCE BETWEEN ODS AND DATA WAREHOUSE
Parameter Operational Data Store (ODS) Data Warehouse (DW)
Purpose
Supports day-to-day operational processes; provides
current, up-to-date data view.
Supports decision-making and analysis with historical data.
Data Content Current, volatile data; frequently updated. Historical, consolidated, and summarized data.
Query Complexity Simple and fast queries on recent data. Complex, ad hoc queries over large datasets.
Data Volatility Highly volatile; data continuously overwritten and updated. Non-volatile; data is read-only and stable over time.
Data Schema
Holds data as per source schema, smaller scope, often
normalized.
Uses denormalized schemas like star or snowflake for
analysis.
Data Integration Integrates data from multiple sources in near real-time.
Integrates and consolidates data from various sources after
ETL.
User Types Operational staff needing up-to-date transactional info.
Analysts, managers, executives requiring comprehensive
insights.
Performance Focus
Optimized for high-speed transactional queries and
updates.
Optimized for read-heavy complex analysis and reporting.
Data Volume Smaller volume focused on current operations. Large volume, stores years of data for historical analysis.
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.11 DATA WAREHOUSING
Metadata = "data about data," crucial for managing a data warehouse. Acts as a roadmap,
defines warehouse objects and contents. Helps developers understand the structure and users
recognize data meaning.Includes data definitions, time stamps, source info.
1.11.6 METADATA IN DATA WAREHOUSING
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
Importance of Basic Statistical Descriptions
1.Provide an overall picture of the data.
2.Identify noise, outliers, and data characteristics.
3.Understand data distribution using central tendency & dispersion measures.
For data preprocessing to be successful, it is essential to have an overall picture of
our data. Basic statistical descriptions can be used to identify properties of the data
and highlight which data values should be treated as noise or outliers.
1.Measures of central tendency include mean, median, mode and midrange.
2.Measures of data dispersion include quartiles, interquartile range (IQR) and variance.
3.Graphic Displays of Basic Statistical Descriptions
3 MAJOR STATISTICAL DESCRIPTIONS
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
1.12.1 Measures of central tendency include mean, median,
mode and midrange.
1.12.2 Measures of data dispersion include quartiles,
interquartile range (IQR) and variance.
1.12.3 Graphic Displays of Basic Statistical Descriptions
3 MAJOR STATISTICAL DESCRIPTIONS
UNIT 1 : INTRODUCTION
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.12.1 MEASURES OF CENTRAL TENDENCY
Measures of central tendency represent the center or typical value of a dataset, summarizing data
using a single value.
UNIT 1 : INTRODUCTION
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.12.1 MEASURES OF CENTRAL TENDENCY
EXAMPLE: MONTHLY SALARIES (USD)
Month Salary
January 105
February 95
March 105
April 105
May 100
Mean = (105+95+105+105+100)/5=102(105+95+105+105+100)/5=102
Median = middle value when ordered (95, 100, 105, 105, 105) = 105
Mode = most frequent value = 105
UNIT 1 : INTRODUCTION
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.12.2 MEASURES OF DISPERSION
UNIT 1 : INTRODUCTION
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
difference between Standard Deviation and Variance
UNIT 1 : INTRODUCTION
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1.12.3Graphic Displays of Basic Statistical Descriptions
1.Scatter Plot: Relationship between two variables;
identify correlations.
2.Histogram: Data distribution grouped in bins;
frequency of ranges.
3.Line Graph: Shows trends over continuous variables or
time.
4.Pie Chart: Shows proportions of categories; easy for
comparisons but limited for close values.
UNIT 1 : INTRODUCTION
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
1. SCATTER DIAGRAM
Also called scatter plot or X-Y graph.
Shows relationship between two variables
(e.g., height vs mass).
One variable on X-axis, another on Y-axis.
Reveals patterns but not causation directly.
Can indicate common causes or surrogate
variables.
UNIT 1 : INTRODUCTION
1.12.3.Graphic Displays of Basic Statistical Descriptions
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
2. HISTOGRAM
Summarizes discrete or continuous data.
Data grouped into intervals called bins (e.g.,
10-19, 20-29).
Bars represent frequency/percentage in each
bin.
Width proportional to bin size; height
proportional to frequency.
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
UNIT 1 : INTRODUCTION
1.12.3.Graphic Displays of Basic Statistical Descriptions
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
3. LINE GRAPHS
Also called stick graphs.
Show relationships over continuous variables or
time (e.g., monthly rainfall).
Useful for identifying patterns, trends, seasonal
effects, turning points.
Multiple data series can be compared on the same
graph.
Simple to interpret and visualize changes
(peaks/valleys).
UNIT 1 : INTRODUCTION
1.12.3.Graphic Displays of Basic Statistical Descriptions
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
4. PIE CHARTS
Circle divided into sectors representing proportions of whole.
Easy to read "pie slices" showing relative sizes.
Common uses: Business product success, school time allocation,
home expenses.
Simple to understand which category is largest.
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
LIMITATION OF PIE CHART:
1.It is difficult to tell the difference between estimates of similar size.
Error bars or confidence limits cannot be shown on pie graph.
Legends and labels on pie graphs are hard to align and read.
2.The human visual system is more efficient at perceiving and discriminating between lines and line lengths
rather than two-dimensional areas and angles.
3.Pie graphs simply don't work when comparing data.
UNIT 1 : INTRODUCTION
1.12.3.Graphic Displays of Basic Statistical Descriptions
1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA

CS 3352 - UNIT 1 PPT INTRODUCTION DRGUPCE

  • 1.
    CS 3351 FOUNDATIONS OFDATA SCIENCE PREPARED BY : MR. K. DANIEL RAJ B.TECH.,M.E.,MISTE., DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ( UG & PG )
  • 2.
    Data Science: Benefitsand uses – facets of data - Data Science Process: Overview – Defining research goals – Retrieving data – Data preparation Exploratory Data analysis – build the model presenting findings and building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data INTRODUCTION UNIT 1 CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING
  • 3.
    1.1 DATA SCIENCE CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING WHAT IS DATA SCIENCE? Interdisciplinary field to extract knowledge and actionable insights from data. Combines math, statistics, programming, AI, and subject expertise. Predicts future trends using historical data and machine learning. Used for business decisions, research, and automation of intelligence tasks. LIFE CYCLE OF DATA SCIENCE 1.Capture: Data acquisition and extraction. 2.Maintain: Data warehousing, cleansing, and processing. 3.Process: Mining, clustering, classification, modeling. 4.Analyze: Reporting, visualization, BI, decision making. 5.Communicate: Predictive analysis, text mining, qualitative analysis. UNIT 1 : INTRODUCTION
  • 4.
    UNIT 1 :INTRODUCTION CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.1.1 BIG DATA Large volumes of complex data generated rapidly from varied sources. Cannot be processed effectively with traditional methods. Characterized by the "3 V's": Volume, Velocity, Variety. 1.1.2 Characteristics of Big Data Characteristic Description Volume Extremely large data amounts (terabytes to petabytes). Velocity Speed of data generation and processing (near real-time). Variety Different data types: structured, unstructured, streaming. Veracity Trustworthiness and credibility of data sources. Value Business value derived from data analysis.
  • 5.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.1.3 DIFFERENCE BETWEEN DATA SCIENCE AND BIG DATA Aspect Data Science Big Data Definition Extraction of insights and actionable knowledge from data. Management, storage, and processing of very large and complex datasets. Primary Focus Analyzing data using algorithms, statistics, machine learning. Handling high volume, velocity, and variety of data. Goal Create predictive models and support decision- making. Efficient data storage, retrieval, and basic data processing. Data Type Cleaned, processed, and prepared data for modeling. Raw, often unstructured and rapidly generated data. Techniques Used Statistical analysis, machine learning, data mining. Distributed computing (e.g., Hadoop, Spark), NoSQL databases. Tools Python, R, SAS, TensorFlow, Scikit-learn. Hadoop ecosystem, Spark, Cassandra, MongoDB, Apache Kafka. Outcome Actionable insights, forecasts, and data-driven decisions. Large-scale data repositories and infrastructure to support analytics. Skillsets Needed Data science, statistics, programming, domain expertise. Data engineering, distributed computing, database management. Applications Fraud detection, recommendation systems, predictive analytics. Social media analytics, IoT data handling, real- time data streams. UNIT 1 : INTRODUCTION 1.1 DATA SCIENCE
  • 6.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.1.4 COMPARISON BETWEEN CLOUD COMPUTING AND BIG DATA Parameter Big Data Cloud Computing Definition Refers to very large, complex, and rapidly growing data sets that traditional tools can't handle. On-demand availability of computing resources (servers, storage, software) over the internet. Focus Managing, storing, and analyzing massive volumes of structured, semi-structured, and unstructured data. Providing scalable IT infrastructure and services remotely to users via the internet. Core Characteristics Volume, Velocity, Variety, Veracity, Value (the 5 V’s). On-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service. Purpose Extract useful information and insights to improve business decisions. Provide flexible, scalable, and cost-effective computing resources without physical infrastructure setup. Data Handling Requires distributed computing platforms (e.g., Hadoop, Spark) to process large datasets efficiently. Uses virtualized resources hosted on cloud vendors (AWS, Azure, Google Cloud). Services Offered Data storage, data mining, analysis, visualization. Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS). Usage Example Analyzing social media data, stock market trends, sensor data. Hosting apps, databases, backup solutions, real-time collaboration tools. Challenges Data integration, storage complexity, ensuring data quality, and real-time processing. Vendor lock-in, security and privacy concerns, downtime risks, dependency on internet connectivity. UNIT 1 : INTRODUCTION 1.1 DATA SCIENCE
  • 7.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING DATA SCIENCE VS BIG DATA Data Science: Focus on extracting insights, predictions, and business intelligence from data. Big Data: Refers to very large datasets that require specialized storage and processing technologies. 1.1.5 Benefits and Uses of Data Science Anomaly detection (fraud, disease). Classification (email sorting, background checks). Forecasting (sales, customer retention). Pattern detection (weather, financial markets). Recognition (face, voice, text). Recommendation systems. Regression and optimization tasks. UNIT 1 : INTRODUCTION 1.1 DATA SCIENCE
  • 8.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING DATA SCIENCE VS BIG DATA Data Science: Focus on extracting insights, predictions, and business intelligence from data. Big Data: Refers to very large datasets that require specialized storage and processing technologies. 1.1.6 Benefits and Uses of Big Data Improved customer service and decision-making. Reduced maintenance costs and operational efficiency. Early risk identification and product marketing optimization. Examples: Social media data, stock exchange trades,compliance reports. UNIT 1 : INTRODUCTION 1.1 DATA SCIENCE
  • 9.
    Very large amountof data will generate in big data and data science. These data is various types and main categories of data are as follows: a) Structured b) Natural language c) Graph-based d) Streaming e) Unstructured f) Machine-generated g) Audio, video and images CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.2 FACETS OF DATA UNIT 1 : INTRODUCTION
  • 10.
    1.2.1 STRUCTURED DATA Dataorganized in rows and columns, making it easy to store, search, and process. Stored using database management systems. Example: Excel tables, relational databases. 1.2.2 UNSTRUCTURED DATA Data without a predefined format or structure. Includes text, audio, video, images, and documents. More than 80% of organizational data is unstructured. Example: Email messages, customer feedback, photos. CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.2 FACETS OF DATA UNIT 1 : INTRODUCTION
  • 11.
    1.2.3 NATURAL LANGUAGE Asubset of unstructured data consisting of human language text or speech. Processed using Natural Language Processing (NLP) for recognizing, understanding, and generating language. Applications: Sentiment analysis, speech recognition, machine translation. 1.2.4 MACHINE-GENERATED DATA Data created automatically by systems or devices without human input. Includes logs, telemetry, API data, and sensor outputs. Can be structured or unstructured. Example: Web server logs, network event logs. CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.2 FACETS OF DATA UNIT 1 : INTRODUCTION
  • 12.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.2 FACETS OF DATA 1.2.5 GRAPH-BASED DATA Data representing entities (nodes) and their relationships (edges). Used in social networks, fraud detection, and recommendation systems. Stored in graph databases and queried via specialized languages like SPARQL. Enables relationship pattern detection and complex network analysis. 1.2.6 AUDIO, IMAGE, AND VIDEO DATA Multimedia data types carrying sound and visual information. Requires advanced data science techniques for recognition and analysis. Includes challenges like large file sizes, diverse formats, and integration. Example: Photos, recorded videos, audio files. 1.2.7 STREAMING DATA Continuous, real-time data generated by thousands of sources simultaneously. Examples: Social media feeds, financial transactions, IoT telemetry. UNIT 1 : INTRODUCTION
  • 13.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.2 FACETS OF DATA UNIT 1 : INTRODUCTION
  • 14.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.2.8 DIFFERENCE BETWEEN STRUCTURED AND UNSTRUCTURED DATA Aspect Structured Data Unstructured Data Format Organized tables (rows & columns) No fixed format or organization Example Database records, Excel sheets Emails, videos, social media posts Ease of Processing Easily processed with SQL queries Requires advanced tools like NLP, image processing Usage Transactional, operational data Media files, natural language, sensor data UNIT 1 : INTRODUCTION 1.2 FACETS OF DATA
  • 15.
    1.3 DATA SCIENCEPROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING Discovery or Setting the research goal Retrieving data Data preparation Data exploration Data modeling Presentation and automation 4 5 6 1 2 3 Overview UNIT 1 : INTRODUCTION
  • 16.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.4 DISCOVERY OR DEFINING RESEARCH GOAL This step involves acquiring data from all the identified internal and external sources, which helps to answer the business question. 1.Learning Business Domain - Understand the industry/domain context of the problem 2.Resources - Technology & Tools,Systems & Infrastructure,Data ,Team/People 3.Frame the Problem - Define a clear problem statement 4.Identify Stakeholders - List those impacted, involved, or benefited 5.Interview Analytics Sponsor - Sponsor = funding / high-level requirement provider 6.Develop Initial Hypotheses - Form testable hypotheses for investigation 7.Identify Data Sources - Check type, volume, and timespan of data UNIT 1 : INTRODUCTION
  • 17.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.5 RETRIEVING DATA It collection of data which required for project. This is the process of gaining a business understanding of the data user have and deciphering what each piece of data means. 1.Internal Data – Start with company-owned data stored in repositories. 2.Data Repositories – Structured systems like warehouses, lakes, marts for storage. 3.External Data – Acquire missing data from third-party, public, or government sources. 4.Data Quality Checks – Clean, validate, and verify data to ensure reliability. UNIT 1 : INTRODUCTION
  • 18.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION Data can have many inconsistencies like missing values, blank columns, an incorrect data format, which needs to be cleaned. We need to process, explore and condition data before modeling. The cleandata, gives the better predictions. 1.Data Cleaning 2.Outlier Detection 3.Handling Missing Values 4.Handling Noisy Data 5.Correct Errors Early 6.Combine Data 7.Transform Data UNIT 1 : INTRODUCTION
  • 19.
    Remove errors, duplicates,and inconsistencies. Tasks: Fill in missing values Standardize formats (date/time, text case) Convert nominal →numeric values Detect & smooth noisy data Fix issues like typos, whitespace errors, capitalization mismatches. DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.1 DATA CLEANING UNIT 1 : INTRODUCTION
  • 20.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.2 OUTLIER DETECTION Outlier = data point that deviates significantly from the norm. Detection: plots, min-max values, statistical models. Applications: fraud detection, intrusion detection, medical & industrial uses. Outliers may indicate errors or special patterns. UNIT 1 : INTRODUCTION
  • 21.
    Missing values reduceaccuracy & reliability. Approaches: 1.Ignore tuple (if multiple attributes missing) 2.Fill manually (small datasets only) 3.Use constant (e.g., “Unknown”) 4.Replace with mean/median/mode 5.Use class-based averages 6.Predict most probable value DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.3 HANDLING MISSING VALUES UNIT 1 : INTRODUCTION
  • 22.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.3.1 HANDLING NOISY DATA Noisy data = errors or random variations in data. Techniques: 1.Smoothing & filtering 2.Aggregating values 3.Binning (grouping into ranges) 4.Statistical methods to remove noise UNIT 1 : INTRODUCTION
  • 23.
    Correcting errors inlater stages = costly & time-consuming. Benefits of early cleanup: 1.Avoids repeated cleansing in multiple projects 2.Prevents wrong business decisions 3.Detects root causes (defective equipment, process gaps, bugs in ETL). DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.4 CORRECT ERRORS EARLY UNIT 1 : INTRODUCTION
  • 24.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.5 COMBINE DATA 1.Joining Tables: Combine related data (using keys). 2.Appending Tables: Stack datasets →larger combined set. 3.Using Views: Create virtual combinations to avoid duplication and storage issues. JOINING TABLES APPENDING TABLES USING VIEWS UNIT 1 : INTRODUCTION
  • 25.
    Convert data intoformats ready for modeling. Methods: 1.Variable Reduction →remove less useful variables, retain useful ones. 2.Normalization →scale values for comparisons. 3.Dummy Variables →convert categorical →binary (0/1). 4.Use Euclidean distance for similarity measures (works best with fewer variables). DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.6 DATA PREPARATION 1.6.6 TRANSFORM DATA EUCLIDEAN DISTANCE = √(X -X ) + (Y -Y ) 1 2 2 1 2 2 DUMMY VARIABLES : 0 OR 1 S S UNIT 1 : INTRODUCTION
  • 26.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION Data exploration is related to deeper understanding of data. Try to understand how variables interact with each other, the distribution of the data and whether there are outliers. This steps is also called as Exploratory Data Analysis. EDA = Process of analyzing datasets using summary statistics & visualizations. Goals: Understand data, detect patterns, spot anomalies, and guide modeling. Techniques: Univariate, Bivariate, Multivariate analysis. UNIT 1 : INTRODUCTION
  • 27.
    1.7.1 PURPOSE OFEDA EDA helps data scientists to: 1.Maximize insight into the dataset 2.Uncover underlying structure 3.Extract key variables 4.Identify outliers & anomalies 5.Test assumptions 6.Develop simple (parsimonious) models 7.Optimize factor settings DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION UNIT 1 : INTRODUCTION
  • 28.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.7.2 KEY FUNCTIONS OF EDA 1.Describe data (basic stats, types) 2.Explore data distributions (shape, spread, skewness) 3.Check relationships between variables 4.Detect unusual patterns/outliers 5.Group data and compare across categories 6.Note differences & dependencies UNIT 1 : INTRODUCTION 1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
  • 29.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.7.3 EDA METHODS Univariate Analysis →Focus on one variable (PDF, CDF, boxplots). Bivariate Analysis →Relationship between two variables (scatterplots, violin plots, boxplots). Multivariate Analysis →Explore interactions among 3+ variables (heatmaps, correlation plots, PCA). UNIT 1 : INTRODUCTION 1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
  • 30.
    1.7.4 BOXPLOTS INEDA Boxplots = visualize distribution, spread, and outliers. Components: 1.Minimum value 2.Lower quartile (25%) 3.Median (50%) 4.Upper quartile (75%) 5.Maximum value 6.Whiskers (variability outside quartiles) 7.Interquartile range (middle 50%) DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING UNIT 1 : INTRODUCTION 1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
  • 31.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING UNIT 1 : INTRODUCTION 1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
  • 32.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.8 BUILD THE MODELS / DATA MODELING In this step, the actual model building process starts. Here, Data scientist distributes datasets for training and testing. Techniques like association, classification and clustering are applied to the training data set. The model, once prepared, is tested against the "testing" dataset. 1.Model & Variable Selection 2.Model Execution 3.Model Diagnostics & Comparison 8Process is iterative – refine until best model is found. UNIT 1 : INTRODUCTION
  • 33.
    1.8.1 MODEL &VARIABLE SELECTION Choose model type based on project goals. Factors to consider: Is the model production-ready & easy to implement? How long will it stay relevant (maintenance)? Does it need to be easily explainable? Select right variables/features for prediction. DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.8 BUILD THE MODELS / DATA MODELING UNIT 1 : INTRODUCTION
  • 34.
    1.8.2 MODEL EXECUTION Implementusing languages & libraries: Python (Scikit-learn, StatsModels) R, SQL (MADlib), Octave, WEKA Commercial tools: SAS, SPSS Modeler, Matlab, Alpine Miner Evaluate results: Model fit – R² / Adjusted R² Coefficients – interpretation for linear models Predictor significance – test influence of variables Use appropriate technique: Regression (predict values) or Classification (predict categories). DATA SCIENCE PROCESS CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING UNIT 1 : INTRODUCTION 1.8 BUILD THE MODELS / DATA MODELING
  • 35.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.8.3 MODEL DIAGNOSTICS & COMPARISON Build multiple models and compare performance. Use holdout method: Split data into training (70–80%) & testing (20–30%). Train on one set, test on the other. Helps check generalization ability of model. Drawbacks of Holdout: 1.Needs extra dataset. 2.Single split may give misleading error estimate. UNIT 1 : INTRODUCTION 1.8 BUILD THE MODELS / DATA MODELING
  • 36.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING EXAMPLE – HOUSE PRICE PREDICTION Input variables: Square footage + Number of rooms Target variable: House price Workflow: Train model on 20 rows Test on 10 rows Measure prediction accuracy Shows how regression models are used in practice. UNIT 1 : INTRODUCTION 1.8 BUILD THE MODELS / DATA MODELING
  • 37.
    DATA SCIENCE PROCESS CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.9 PRESENTATION AND AUTOMATION Deliver the final baselined model with reports, code and technical documents in this stage. Model is deployed into a real-time production environment after thorough testing. 1.Deliverables: Reports, briefings, technical docs, code 2.Pilot projects →Production deployment 3.Skills needed: Communication + Business alignment 4.Goal: Integrate models into tools & workflows UNIT 1 : INTRODUCTION
  • 38.
    1.10 DATA MINING CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING Data mining refers to extracting or mining knowledge from large amounts of data. It is a process of discovering interesting patterns or Knowledge from a large amount of data stored either in databases, data warehouses or other information repositories. WHY DATA MINING? 1.Knowledge Discovery – Find hidden patterns & correlations. 2.Data Visualization – Present data in meaningful ways. 3.Data Correction – Identify & fix errors/inconsistencies. 1.10.1 FUNCTIONS OF DATA MINING 1.Characterization – Summarize key features (e.g., student profiles). 2.Association – Discover rules/patterns (e.g., market basket analysis). 3.Classification – Build models to categorize data. 4.Prediction – Estimate unknown/future values. 5.Clustering – Group similar objects (taxonomy formation). 6.Evolution Analysis – Study patterns over time. UNIT 1 : INTRODUCTION
  • 39.
    1.10.2 Predictive Tasks: Supervisedlearning →classification, regression, time-series. Uses historical data to forecast outcomes. Applications: credit scoring, customer behavior, demand forecasting. 1.10.3 Descriptive Tasks: Summarize & visualize past data. Techniques: aggregation, reporting, BI dashboards. Helps understand trends & influence future actions. 1.10 DATA MINING CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING MINING TASK TYPES UNIT 1 : INTRODUCTION
  • 40.
    1.10 DATA MINING CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING MINING TASK TYPES UNIT 1 : INTRODUCTION
  • 41.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.10.4 ARCHITECTURE OF A DATA MINING SYSTEM UNIT 1 : INTRODUCTION 1.10 DATA MINING
  • 42.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING Data Sources – Databases, warehouses, web & files. Data Warehouse Server – Fetches & integrates relevant data. Knowledge Base – Stores prior knowledge, user beliefs, rules. Mining Engine – Core component (classification, clustering, prediction). Pattern Evaluation Module – Filters interesting/important patterns. Graphical User Interface – User-friendly interaction & result display. UNIT 1 : INTRODUCTION 1.10.4 ARCHITECTURE OF A DATA MINING SYSTEM 1.10 DATA MINING
  • 43.
    1.10 DATA MINING CS3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.10.5 CLASSIFICATION OF DATA MINING SYSTEMS OTHER DISCIPLINES VISUALIZATION INFORMATION SCIENCE DATA BASE STATISTICS MACHINE LEARNING DATA MINING UNIT 1 : INTRODUCTION
  • 44.
    Parameter Description Classification Algorithms Differentalgorithms used for classification such as Decision Trees, SVM, Neural Networks, k-NN, Bayesian. Learning Paradigm Supervised (uses labeled data) vs Unsupervised (e.g., clustering, no labeled data). Classification Type Binary, Multi-class, Multi-label, Hierarchical (nested categories). Approach Discriminative (boundary-based) vs Generative (distribution estimation). Data Types Structured (tables), Unstructured (text, images), Time-series, Spatial, etc. Application Domain Domains like finance, healthcare, marketing, image recognition with domain-specific features and models. Tools & Platforms Software examples: Python (Scikit-learn), R, WEKA, SAS, SPSS, MATLAB, etc. Model Complexity Simple (linear) to complex (deep learning, ensemble methods). Evaluation Techniques Metrics such as accuracy, precision, recall, F1-score, ROC, cross- validation, holdout method. 1.10 DATA MINING CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING MULTI-DIMENSIONAL VIEW OF DATA MINING CLASSIFICATION. UNIT 1 : INTRODUCTION
  • 45.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11 DATA WAREHOUSING Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries and decision making. Data warehousing involves data cleaning, data integration and data consolidations. What is a Data Warehouse? Subject-oriented, integrated, time-variant, non-volatile collection of data. Stores historical data to support management decisions. Separate from operational databases (focus on analysis vs transactions). Key Characteristics of Data Warehouses Characteristic Description Subject-Oriented Organized by major business subjects (e.g., sales). Integrated Consistent naming, formats, and value representations. Non-volatile Data usually read-only, no deletions or real-time updates. Time-variant Stores historical data over months or years. UNIT 1 : INTRODUCTION
  • 46.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11 DATA WAREHOUSING 1.11.1 KEY CHARACTERISTICS OF DATA WAREHOUSES Characteristic Description Subject-Oriented Organized by major business subjects (e.g., sales). Integrated Consistent naming, formats, and value representations. Non-volatile Data usually read-only, no deletions or real- time updates. Time-variant Stores historical data over months or years. UNIT 1 : INTRODUCTION
  • 47.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11 DATA WAREHOUSING Goals of Data Warehousing Facilitate reporting and analysis. Maintain historical information. Support strategic decision-making. How Organizations Use Data Warehousing Improve customer focus via buying pattern analysis. Optimize product portfolios and reposition products. Analyze operations for profit optimization. Manage customer relationships and control costs. UNIT 1 : INTRODUCTION
  • 48.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING DATA WAREHOUSE VS DATABASE Feature Database Data Warehouse Purpose Supports day-to-day operations (transactions). Stores historical data for analysis & reporting. Data Type Current, real-time data. Time-variant, historical data. Update Frequency Frequent updates, inserts, and deletes. Mostly read-only, periodic batch loads. Structure Application-oriented, normalized for i Subject-oriented, denormalized for analysis. UNIT 1 : INTRODUCTION 1.11 DATA WAREHOUSING
  • 49.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11 DATA WAREHOUSING 1.Single-tier: Minimize data redundancy; limited scalability. 2.Two-tier: Separation of resources & warehouse; small scale. 3.Three-tier (Multi-tier): Bottom tier: Warehouse database server (stores cleansed data). Middle tier: OLAP server (arranges data for analysis using ROLAP or MOLAP). Top tier: Front-end tools for querying, reporting, visualization, and analytics. OLAPS can interact with both relational databases and multidimensional databases, which lets them collect data better based on broader parameters. UNIT 1 : INTRODUCTION 1.11.2 MULTI TIER ARCHITECTURE OF DWH
  • 50.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11 DATA WAREHOUSING UNIT 1 : INTRODUCTION
  • 51.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11.3 NEEDS OF DATA WAREHOUSE Business users need summarized, easy-to-understand data. Store historical, time-variant data for trend analysis. Support strategic decision-making. Ensure data consistency across sources. Provide high-speed query response for ad hoc demands. 1.11.4 BENEFITS OF DATA WAREHOUSING Better understanding of business trends for forecasting. Efficient handling of large data volumes with good performance. Simplified data structure for end-user navigation & queries. Easier to build and maintain complex queries compared to normalized databases. Efficiently manage high demand for information. UNIT 1 : INTRODUCTION 1.11 DATA WAREHOUSING
  • 52.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING UNIT 1 : INTRODUCTION 1.11 DATA WAREHOUSING 1.11.5 DIFFERENCE BETWEEN ODS AND DATA WAREHOUSE Parameter Operational Data Store (ODS) Data Warehouse (DW) Purpose Supports day-to-day operational processes; provides current, up-to-date data view. Supports decision-making and analysis with historical data. Data Content Current, volatile data; frequently updated. Historical, consolidated, and summarized data. Query Complexity Simple and fast queries on recent data. Complex, ad hoc queries over large datasets. Data Volatility Highly volatile; data continuously overwritten and updated. Non-volatile; data is read-only and stable over time. Data Schema Holds data as per source schema, smaller scope, often normalized. Uses denormalized schemas like star or snowflake for analysis. Data Integration Integrates data from multiple sources in near real-time. Integrates and consolidates data from various sources after ETL. User Types Operational staff needing up-to-date transactional info. Analysts, managers, executives requiring comprehensive insights. Performance Focus Optimized for high-speed transactional queries and updates. Optimized for read-heavy complex analysis and reporting. Data Volume Smaller volume focused on current operations. Large volume, stores years of data for historical analysis.
  • 53.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.11 DATA WAREHOUSING Metadata = "data about data," crucial for managing a data warehouse. Acts as a roadmap, defines warehouse objects and contents. Helps developers understand the structure and users recognize data meaning.Includes data definitions, time stamps, source info. 1.11.6 METADATA IN DATA WAREHOUSING UNIT 1 : INTRODUCTION
  • 54.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA Importance of Basic Statistical Descriptions 1.Provide an overall picture of the data. 2.Identify noise, outliers, and data characteristics. 3.Understand data distribution using central tendency & dispersion measures. For data preprocessing to be successful, it is essential to have an overall picture of our data. Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers. 1.Measures of central tendency include mean, median, mode and midrange. 2.Measures of data dispersion include quartiles, interquartile range (IQR) and variance. 3.Graphic Displays of Basic Statistical Descriptions 3 MAJOR STATISTICAL DESCRIPTIONS UNIT 1 : INTRODUCTION
  • 55.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA 1.12.1 Measures of central tendency include mean, median, mode and midrange. 1.12.2 Measures of data dispersion include quartiles, interquartile range (IQR) and variance. 1.12.3 Graphic Displays of Basic Statistical Descriptions 3 MAJOR STATISTICAL DESCRIPTIONS UNIT 1 : INTRODUCTION
  • 56.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.12.1 MEASURES OF CENTRAL TENDENCY Measures of central tendency represent the center or typical value of a dataset, summarizing data using a single value. UNIT 1 : INTRODUCTION 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 57.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.12.1 MEASURES OF CENTRAL TENDENCY EXAMPLE: MONTHLY SALARIES (USD) Month Salary January 105 February 95 March 105 April 105 May 100 Mean = (105+95+105+105+100)/5=102(105+95+105+105+100)/5=102 Median = middle value when ordered (95, 100, 105, 105, 105) = 105 Mode = most frequent value = 105 UNIT 1 : INTRODUCTION 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 58.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.12.2 MEASURES OF DISPERSION UNIT 1 : INTRODUCTION 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 59.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING difference between Standard Deviation and Variance UNIT 1 : INTRODUCTION 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 60.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1.12.3Graphic Displays of Basic Statistical Descriptions 1.Scatter Plot: Relationship between two variables; identify correlations. 2.Histogram: Data distribution grouped in bins; frequency of ranges. 3.Line Graph: Shows trends over continuous variables or time. 4.Pie Chart: Shows proportions of categories; easy for comparisons but limited for close values. UNIT 1 : INTRODUCTION 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 61.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 1. SCATTER DIAGRAM Also called scatter plot or X-Y graph. Shows relationship between two variables (e.g., height vs mass). One variable on X-axis, another on Y-axis. Reveals patterns but not causation directly. Can indicate common causes or surrogate variables. UNIT 1 : INTRODUCTION 1.12.3.Graphic Displays of Basic Statistical Descriptions 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 62.
    2. HISTOGRAM Summarizes discreteor continuous data. Data grouped into intervals called bins (e.g., 10-19, 20-29). Bars represent frequency/percentage in each bin. Width proportional to bin size; height proportional to frequency. CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING UNIT 1 : INTRODUCTION 1.12.3.Graphic Displays of Basic Statistical Descriptions 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 63.
    CS 3351 FOUNDATIONSOF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING 3. LINE GRAPHS Also called stick graphs. Show relationships over continuous variables or time (e.g., monthly rainfall). Useful for identifying patterns, trends, seasonal effects, turning points. Multiple data series can be compared on the same graph. Simple to interpret and visualize changes (peaks/valleys). UNIT 1 : INTRODUCTION 1.12.3.Graphic Displays of Basic Statistical Descriptions 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
  • 64.
    4. PIE CHARTS Circledivided into sectors representing proportions of whole. Easy to read "pie slices" showing relative sizes. Common uses: Business product success, school time allocation, home expenses. Simple to understand which category is largest. CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING LIMITATION OF PIE CHART: 1.It is difficult to tell the difference between estimates of similar size. Error bars or confidence limits cannot be shown on pie graph. Legends and labels on pie graphs are hard to align and read. 2.The human visual system is more efficient at perceiving and discriminating between lines and line lengths rather than two-dimensional areas and angles. 3.Pie graphs simply don't work when comparing data. UNIT 1 : INTRODUCTION 1.12.3.Graphic Displays of Basic Statistical Descriptions 1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA