CS 3352 - UNIT 1 PPT INTRODUCTION DRGUPCE

CS 3351
FOUNDATIONS OF DATA SCIENCE
PREPARED BY : MR. K. DANIEL RAJ B.TECH.,M.E.,MISTE.,
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ( UG & PG )

Data Science: Benefits and uses – facets of data - Data Science
Process: Overview – Defining research goals – Retrieving data –
Data preparation Exploratory Data analysis – build the model
presenting findings and building applications - Data Mining - Data
Warehousing – Basic Statistical descriptions of Data
INTRODUCTION
UNIT 1
CS 3351 FOUNDATIONS OF DATA SCIENCE DR.G.U.POPE COLLEGE OF ENGINEERING

1.1 DATA SCIENCE
WHAT IS DATA SCIENCE?
Interdisciplinary field to extract knowledge and actionable insights from data.
Combines math, statistics, programming, AI, and subject expertise.
Predicts future trends using historical data and machine learning.
Used for business decisions, research, and automation of intelligence tasks.
LIFE CYCLE OF DATA SCIENCE
1.Capture: Data acquisition and extraction.
2.Maintain: Data warehousing, cleansing, and processing.
3.Process: Mining, clustering, classification, modeling.
4.Analyze: Reporting, visualization, BI, decision making.
5.Communicate: Predictive analysis, text mining, qualitative analysis.
UNIT 1 : INTRODUCTION

1.1.1 BIG DATA
Large volumes of complex data generated rapidly from varied sources.
Cannot be processed effectively with traditional methods.
Characterized by the "3 V's": Volume, Velocity, Variety.
1.1.2 Characteristics of Big Data
Characteristic Description
Volume Extremely large data amounts (terabytes to petabytes).
Velocity Speed of data generation and processing (near real-time).
Variety Different data types: structured, unstructured, streaming.
Veracity Trustworthiness and credibility of data sources.
Value Business value derived from data analysis.

1.1.3 DIFFERENCE BETWEEN DATA SCIENCE AND BIG DATA
Aspect Data Science Big Data
Definition Extraction of insights and actionable
knowledge from data.
Management, storage, and processing of very
large and complex datasets.
Primary Focus Analyzing data using algorithms, statistics,
machine learning.
Handling high volume, velocity, and variety of
data.
Goal Create predictive models and support decision-
making.
Efficient data storage, retrieval, and basic data
processing.
Data Type Cleaned, processed, and prepared data for
modeling.
Raw, often unstructured and rapidly generated
data.
Techniques Used Statistical analysis, machine learning, data
mining.
Distributed computing (e.g., Hadoop, Spark),
NoSQL databases.
Tools Python, R, SAS, TensorFlow, Scikit-learn. Hadoop ecosystem, Spark, Cassandra,
MongoDB, Apache Kafka.
Outcome Actionable insights, forecasts, and data-driven
decisions.
Large-scale data repositories and infrastructure
to support analytics.
Skillsets Needed Data science, statistics, programming, domain
expertise.
Data engineering, distributed computing,
database management.
Applications Fraud detection, recommendation systems,
predictive analytics.
Social media analytics, IoT data handling, real-
time data streams.
1.1 DATA SCIENCE

1.1.4 COMPARISON BETWEEN CLOUD COMPUTING AND BIG DATA
Parameter Big Data Cloud Computing
Definition
Refers to very large, complex, and rapidly growing data
sets that traditional tools can't handle.
On-demand availability of computing resources (servers,
storage, software) over the internet.
Focus
Managing, storing, and analyzing massive volumes of
structured, semi-structured, and unstructured data.
Providing scalable IT infrastructure and services remotely
to users via the internet.
Core Characteristics Volume, Velocity, Variety, Veracity, Value (the 5 V’s).
On-demand self-service, broad network access, resource
pooling, rapid elasticity, and measured service.
Purpose
Extract useful information and insights to improve
business decisions.
Provide flexible, scalable, and cost-effective computing
resources without physical infrastructure setup.
Data Handling
Requires distributed computing platforms (e.g., Hadoop,
Spark) to process large datasets efficiently.
Uses virtualized resources hosted on cloud vendors
(AWS, Azure, Google Cloud).
Services Offered Data storage, data mining, analysis, visualization.
Infrastructure as a Service (IaaS), Platform as a Service
(PaaS), Software as a Service (SaaS).
Usage Example
Analyzing social media data, stock market trends, sensor
data.
Hosting apps, databases, backup solutions, real-time
collaboration tools.
Challenges
Data integration, storage complexity, ensuring data
quality, and real-time processing.
Vendor lock-in, security and privacy concerns, downtime
risks, dependency on internet connectivity.
1.1 DATA SCIENCE

DATA SCIENCE VS BIG DATA
Data Science: Focus on extracting insights, predictions, and business intelligence from data.
Big Data: Refers to very large datasets that require specialized storage and processing technologies.
1.1.5 Benefits and Uses of Data Science
Anomaly detection (fraud, disease).
Classification (email sorting, background checks).
Forecasting (sales, customer retention).
Pattern detection (weather, financial markets).
Recognition (face, voice, text).
Recommendation systems.
Regression and optimization tasks.
1.1 DATA SCIENCE

DATA SCIENCE VS BIG DATA
Data Science: Focus on extracting insights, predictions, and business intelligence from data.
Big Data: Refers to very large datasets that require specialized storage and processing technologies.
1.1.6 Benefits and Uses of Big Data
Improved customer service and decision-making.
Reduced maintenance costs and operational efficiency.
Early risk identification and product marketing optimization.
Examples: Social media data, stock exchange trades,compliance
reports.
1.1 DATA SCIENCE

Very large amount of data will generate in big data and data science.
These data is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
1.2 FACETS OF DATA

1.2.1 STRUCTURED DATA
Data organized in rows and columns, making it easy to store, search, and process.
Stored using database management systems.
Example: Excel tables, relational databases.
1.2.2 UNSTRUCTURED DATA
Data without a predefined format or structure.
Includes text, audio, video, images, and documents.
More than 80% of organizational data is unstructured.
Example: Email messages, customer feedback, photos.
1.2 FACETS OF DATA

1.2.3 NATURAL LANGUAGE
A subset of unstructured data consisting of human language text or speech.
Processed using Natural Language Processing (NLP) for recognizing, understanding,
and generating language.
Applications: Sentiment analysis, speech recognition, machine translation.
1.2.4 MACHINE-GENERATED DATA
Data created automatically by systems or devices without human input.
Includes logs, telemetry, API data, and sensor outputs.
Can be structured or unstructured.
Example: Web server logs, network event logs.
1.2 FACETS OF DATA

1.2 FACETS OF DATA
1.2.5 GRAPH-BASED DATA
Data representing entities (nodes) and their relationships (edges).
Used in social networks, fraud detection, and recommendation systems.
Stored in graph databases and queried via specialized languages like SPARQL.
Enables relationship pattern detection and complex network analysis.
1.2.6 AUDIO, IMAGE, AND VIDEO DATA
Multimedia data types carrying sound and visual information.
Requires advanced data science techniques for recognition and analysis.
Includes challenges like large file sizes, diverse formats, and integration.
Example: Photos, recorded videos, audio files.
1.2.7 STREAMING DATA
Continuous, real-time data generated by thousands of sources simultaneously.
Examples: Social media feeds, financial transactions, IoT telemetry.

1.2 FACETS OF DATA

1.2.8 DIFFERENCE BETWEEN STRUCTURED AND UNSTRUCTURED DATA
Aspect Structured Data Unstructured Data
Format
Organized tables (rows &
columns)
No fixed format or
organization
Example
Database records, Excel
sheets
Emails, videos, social media
posts
Ease of Processing
Easily processed with SQL
queries
Requires advanced tools
like NLP, image processing
Usage
Transactional, operational
data
Media files, natural
language, sensor data
1.2 FACETS OF DATA

1.3 DATA SCIENCE PROCESS
Discovery or Setting the
research goal
Retrieving data
Data preparation
Data exploration
Data modeling
Presentation and automation
4
5
6
1
2
3
Overview

DATA SCIENCE PROCESS
1.4 DISCOVERY OR DEFINING RESEARCH GOAL
This step involves acquiring data from all the identified internal and
external sources, which helps to answer the business question.
1.Learning Business Domain - Understand the industry/domain context of the problem
2.Resources - Technology & Tools,Systems & Infrastructure,Data ,Team/People
3.Frame the Problem - Define a clear problem statement
4.Identify Stakeholders - List those impacted, involved, or benefited
5.Interview Analytics Sponsor - Sponsor = funding / high-level requirement provider
6.Develop Initial Hypotheses - Form testable hypotheses for investigation
7.Identify Data Sources - Check type, volume, and timespan of data

1.5 RETRIEVING DATA
It collection of data which required for project. This is the process of
gaining a business understanding of the data user have and deciphering
what each piece of data means.
1.Internal Data – Start with company-owned data stored in repositories.
2.Data Repositories – Structured systems like warehouses, lakes, marts for
storage.
3.External Data – Acquire missing data from third-party, public, or
government sources.
4.Data Quality Checks – Clean, validate, and verify data to ensure reliability.

1.6 DATA PREPARATION
Data can have many inconsistencies like missing values, blank columns, an
incorrect data format, which needs to be cleaned. We need to process, explore
and condition data before modeling. The cleandata, gives the better
predictions.
1.Data Cleaning
2.Outlier Detection
3.Handling Missing Values
4.Handling Noisy Data
5.Correct Errors Early
6.Combine Data
7.Transform Data

Remove errors, duplicates, and inconsistencies.
Tasks:
Fill in missing values
Standardize formats (date/time, text case)
Convert nominal →numeric values
Detect & smooth noisy data
Fix issues like typos, whitespace errors, capitalization mismatches.
1.6.1 DATA CLEANING

1.6.2 OUTLIER DETECTION
Outlier = data point that deviates significantly from the norm.
Detection: plots, min-max values, statistical models.
Applications: fraud detection, intrusion detection, medical & industrial uses.
Outliers may indicate errors or special patterns.

Missing values reduce accuracy & reliability.
Approaches:
1.Ignore tuple (if multiple attributes missing)
2.Fill manually (small datasets only)
3.Use constant (e.g., “Unknown”)
4.Replace with mean/median/mode
5.Use class-based averages
6.Predict most probable value
1.6.3 HANDLING MISSING VALUES

1.6.3.1 HANDLING NOISY DATA
Noisy data = errors or random variations in data.
Techniques:
1.Smoothing & filtering
2.Aggregating values
3.Binning (grouping into ranges)
4.Statistical methods to remove noise

Correcting errors in later stages = costly & time-consuming.
Benefits of early cleanup:
1.Avoids repeated cleansing in multiple projects
2.Prevents wrong business decisions
3.Detects root causes (defective equipment, process gaps, bugs in
ETL).
1.6.4 CORRECT ERRORS EARLY

1.6.5 COMBINE DATA
1.Joining Tables: Combine related data (using keys).
2.Appending Tables: Stack datasets →larger combined set.
3.Using Views: Create virtual combinations to avoid duplication and
storage issues.
JOINING TABLES
APPENDING TABLES USING VIEWS

Convert data into formats ready for modeling.
Methods:
1.Variable Reduction →remove less useful variables, retain useful ones.
2.Normalization →scale values for comparisons.
3.Dummy Variables →convert categorical →binary (0/1).
4.Use Euclidean distance for similarity measures (works best with fewer variables).
1.6.6 TRANSFORM DATA
EUCLIDEAN DISTANCE = √(X -X ) + (Y -Y )
1 2
2
1 2
2
DUMMY VARIABLES : 0 OR 1
S S

1.7 EXPLORATORY DATA ANALYSIS ( OR ) DATA EXPLORATION
Data exploration is related to deeper understanding of data. Try to
understand how variables interact with each other, the distribution of
the data and whether there are outliers. This steps is also called as
Exploratory Data Analysis.
EDA = Process of analyzing datasets using summary statistics &
visualizations.
Goals: Understand data, detect patterns, spot anomalies, and guide
modeling.
Techniques: Univariate, Bivariate, Multivariate analysis.

1.7.1 PURPOSE OF EDA
EDA helps data scientists to:
1.Maximize insight into the dataset
2.Uncover underlying structure
3.Extract key variables
4.Identify outliers & anomalies
5.Test assumptions
6.Develop simple (parsimonious) models
7.Optimize factor settings

1.7.2 KEY FUNCTIONS OF EDA
1.Describe data (basic stats, types)
2.Explore data distributions (shape, spread, skewness)
3.Check relationships between variables
4.Detect unusual patterns/outliers
5.Group data and compare across categories
6.Note differences & dependencies

1.7.3 EDA METHODS
Univariate Analysis →Focus on one variable (PDF, CDF,
boxplots).
Bivariate Analysis →Relationship between two variables
(scatterplots, violin plots, boxplots).
Multivariate Analysis →Explore interactions among 3+
variables (heatmaps, correlation plots, PCA).

1.7.4 BOXPLOTS IN EDA
Boxplots = visualize distribution, spread, and outliers.
Components:
1.Minimum value
2.Lower quartile (25%)
3.Median (50%)
4.Upper quartile (75%)
5.Maximum value
6.Whiskers (variability outside quartiles)
7.Interquartile range (middle 50%)

1.8 BUILD THE MODELS / DATA MODELING
In this step, the actual model building process starts. Here, Data
scientist distributes datasets for training and testing. Techniques like
association, classification and clustering are applied to the training
data set. The model, once prepared, is tested against the "testing"
dataset.
1.Model & Variable Selection
2.Model Execution
3.Model Diagnostics & Comparison
8Process is iterative – refine until best model is found.

1.8.1 MODEL & VARIABLE SELECTION
Choose model type based on project goals.
Factors to consider:
Is the model production-ready & easy to implement?
How long will it stay relevant (maintenance)?
Does it need to be easily explainable?
Select right variables/features for prediction.

1.8.2 MODEL EXECUTION
Implement using languages & libraries:
Python (Scikit-learn, StatsModels)
R, SQL (MADlib), Octave, WEKA
Commercial tools: SAS, SPSS Modeler, Matlab, Alpine Miner
Evaluate results:
Model fit – R² / Adjusted R²
Coefficients – interpretation for linear models
Predictor significance – test influence of variables
Use appropriate technique: Regression (predict values) or Classification (predict categories).

1.8.3 MODEL DIAGNOSTICS & COMPARISON
Build multiple models and compare performance.
Use holdout method:
Split data into training (70–80%) & testing (20–30%).
Train on one set, test on the other.
Helps check generalization ability of model.
Drawbacks of Holdout:
1.Needs extra dataset.
2.Single split may give misleading error estimate.

EXAMPLE – HOUSE PRICE PREDICTION
Input variables: Square footage + Number of rooms
Target variable: House price
Workflow:
Train model on 20 rows
Test on 10 rows
Measure prediction accuracy
Shows how regression models are used in practice.

1.9 PRESENTATION AND AUTOMATION
Deliver the final baselined model with reports, code and technical
documents in this stage. Model is deployed into a real-time production
environment after thorough testing.
1.Deliverables: Reports, briefings, technical docs, code
2.Pilot projects →Production deployment
3.Skills needed: Communication + Business alignment
4.Goal: Integrate models into tools & workflows

1.10 DATA MINING
Data mining refers to extracting or mining knowledge from large amounts of data. It is a process
of discovering interesting patterns or Knowledge from a large amount of data stored either in
databases, data warehouses or other information repositories.
WHY DATA MINING?
1.Knowledge Discovery – Find hidden patterns & correlations.
2.Data Visualization – Present data in meaningful ways.
3.Data Correction – Identify & fix errors/inconsistencies.
1.10.1 FUNCTIONS OF DATA MINING
1.Characterization – Summarize key features (e.g., student profiles).
2.Association – Discover rules/patterns (e.g., market basket analysis).
3.Classification – Build models to categorize data.
4.Prediction – Estimate unknown/future values.
5.Clustering – Group similar objects (taxonomy formation).
6.Evolution Analysis – Study patterns over time.

1.10.2 Predictive Tasks:
Supervised learning →classification, regression, time-series.
Uses historical data to forecast outcomes.
Applications: credit scoring, customer behavior, demand forecasting.
1.10.3 Descriptive Tasks:
Summarize & visualize past data.
Techniques: aggregation, reporting, BI dashboards.
Helps understand trends & influence future actions.
1.10 DATA MINING
MINING TASK TYPES

1.10 DATA MINING
MINING TASK TYPES

1.10.4 ARCHITECTURE OF A DATA MINING SYSTEM
1.10 DATA MINING

Data Sources – Databases, warehouses, web & files.
Data Warehouse Server – Fetches & integrates relevant data.
Knowledge Base – Stores prior knowledge, user beliefs, rules.
Mining Engine – Core component (classification, clustering, prediction).
Pattern Evaluation Module – Filters interesting/important patterns.
Graphical User Interface – User-friendly interaction & result display.
1.10.4 ARCHITECTURE OF A DATA MINING SYSTEM
1.10 DATA MINING

1.10 DATA MINING
1.10.5 CLASSIFICATION OF DATA MINING SYSTEMS
OTHER
DISCIPLINES
VISUALIZATION
INFORMATION
SCIENCE
DATA
BASE
STATISTICS
MACHINE
LEARNING
DATA
MINING

Parameter Description
Classification Algorithms
Different algorithms used for classification such as Decision Trees,
SVM, Neural Networks, k-NN, Bayesian.
Learning Paradigm
Supervised (uses labeled data) vs Unsupervised (e.g., clustering, no
labeled data).
Classification Type Binary, Multi-class, Multi-label, Hierarchical (nested categories).
Approach
Discriminative (boundary-based) vs Generative (distribution
estimation).
Data Types
Structured (tables), Unstructured (text, images), Time-series, Spatial,
etc.
Application Domain
Domains like finance, healthcare, marketing, image recognition with
domain-specific features and models.
Tools & Platforms
Software examples: Python (Scikit-learn), R, WEKA, SAS, SPSS,
MATLAB, etc.
Model Complexity Simple (linear) to complex (deep learning, ensemble methods).
Evaluation Techniques
Metrics such as accuracy, precision, recall, F1-score, ROC, cross-
validation, holdout method.
1.10 DATA MINING
MULTI-DIMENSIONAL VIEW OF DATA MINING CLASSIFICATION.

1.11 DATA WAREHOUSING
Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries and decision making. Data
warehousing involves data cleaning, data integration and data consolidations.
What is a Data Warehouse?
Subject-oriented, integrated, time-variant, non-volatile collection of data.
Stores historical data to support management decisions.
Separate from operational databases (focus on analysis vs transactions).
Key Characteristics of Data Warehouses
Subject-Oriented Organized by major business subjects (e.g., sales).
Integrated Consistent naming, formats, and value representations.
Non-volatile Data usually read-only, no deletions or real-time updates.
Time-variant Stores historical data over months or years.

1.11.1 KEY CHARACTERISTICS OF DATA WAREHOUSES
Subject-Oriented
Organized by major business subjects (e.g.,
sales).
Integrated
Consistent naming, formats, and value
representations.
Non-volatile
Data usually read-only, no deletions or real-
time updates.
Time-variant Stores historical data over months or years.

Goals of Data Warehousing
Facilitate reporting and analysis.
Maintain historical information.
Support strategic decision-making.
How Organizations Use Data Warehousing
Improve customer focus via buying pattern analysis.
Optimize product portfolios and reposition products.
Analyze operations for profit optimization.
Manage customer relationships and control costs.

DATA WAREHOUSE VS DATABASE
Feature Database Data Warehouse
Purpose
Supports day-to-day
operations (transactions).
Stores historical data for
analysis & reporting.
Data Type Current, real-time data.
Time-variant, historical
data.
Update Frequency
Frequent updates, inserts,
and deletes.
Mostly read-only, periodic
batch loads.
Structure
Application-oriented,
normalized for
i
Subject-oriented,
denormalized for analysis.

1.Single-tier: Minimize data redundancy; limited scalability.
2.Two-tier: Separation of resources & warehouse; small scale.
3.Three-tier (Multi-tier):
Bottom tier: Warehouse database server (stores cleansed data).
Middle tier: OLAP server (arranges data for analysis using
ROLAP or MOLAP).
Top tier: Front-end tools for querying, reporting, visualization,
and analytics.
OLAPS can interact with both relational databases and
multidimensional databases, which lets them collect data
better based on broader parameters.
1.11.2 MULTI TIER ARCHITECTURE OF DWH

1.11.3 NEEDS OF DATA WAREHOUSE
Business users need summarized, easy-to-understand data.
Store historical, time-variant data for trend analysis.
Support strategic decision-making.
Ensure data consistency across sources.
Provide high-speed query response for ad hoc demands.
1.11.4 BENEFITS OF DATA WAREHOUSING
Better understanding of business trends for forecasting.
Efficient handling of large data volumes with good performance.
Simplified data structure for end-user navigation & queries.
Easier to build and maintain complex queries compared to normalized databases.
Efficiently manage high demand for information.

1.11.5 DIFFERENCE BETWEEN ODS AND DATA WAREHOUSE
Parameter Operational Data Store (ODS) Data Warehouse (DW)
Purpose
Supports day-to-day operational processes; provides
current, up-to-date data view.
Supports decision-making and analysis with historical data.
Data Content Current, volatile data; frequently updated. Historical, consolidated, and summarized data.
Query Complexity Simple and fast queries on recent data. Complex, ad hoc queries over large datasets.
Data Volatility Highly volatile; data continuously overwritten and updated. Non-volatile; data is read-only and stable over time.
Data Schema
Holds data as per source schema, smaller scope, often
normalized.
Uses denormalized schemas like star or snowflake for
analysis.
Data Integration Integrates data from multiple sources in near real-time.
Integrates and consolidates data from various sources after
ETL.
User Types Operational staff needing up-to-date transactional info.
Analysts, managers, executives requiring comprehensive
insights.
Performance Focus
Optimized for high-speed transactional queries and
updates.
Optimized for read-heavy complex analysis and reporting.
Data Volume Smaller volume focused on current operations. Large volume, stores years of data for historical analysis.

Metadata = "data about data," crucial for managing a data warehouse. Acts as a roadmap,
defines warehouse objects and contents. Helps developers understand the structure and users
recognize data meaning.Includes data definitions, time stamps, source info.
1.11.6 METADATA IN DATA WAREHOUSING

1.12 BASIC STATISTICAL DESCRIPTIONS OF DATA
Importance of Basic Statistical Descriptions
1.Provide an overall picture of the data.
2.Identify noise, outliers, and data characteristics.
3.Understand data distribution using central tendency & dispersion measures.
For data preprocessing to be successful, it is essential to have an overall picture of
our data. Basic statistical descriptions can be used to identify properties of the data
and highlight which data values should be treated as noise or outliers.
1.Measures of central tendency include mean, median, mode and midrange.
2.Measures of data dispersion include quartiles, interquartile range (IQR) and variance.
3.Graphic Displays of Basic Statistical Descriptions
3 MAJOR STATISTICAL DESCRIPTIONS

1.12.1 Measures of central tendency include mean, median,
mode and midrange.
1.12.2 Measures of data dispersion include quartiles,
interquartile range (IQR) and variance.
1.12.3 Graphic Displays of Basic Statistical Descriptions
3 MAJOR STATISTICAL DESCRIPTIONS

1.12.1 MEASURES OF CENTRAL TENDENCY
Measures of central tendency represent the center or typical value of a dataset, summarizing data
using a single value.

1.12.1 MEASURES OF CENTRAL TENDENCY
EXAMPLE: MONTHLY SALARIES (USD)
Month Salary
January 105
February 95
March 105
April 105
May 100
Mean = (105+95+105+105+100)/5=102(105+95+105+105+100)/5=102
Median = middle value when ordered (95, 100, 105, 105, 105) = 105
Mode = most frequent value = 105

1.12.2 MEASURES OF DISPERSION

difference between Standard Deviation and Variance

1.12.3Graphic Displays of Basic Statistical Descriptions
1.Scatter Plot: Relationship between two variables;
identify correlations.
2.Histogram: Data distribution grouped in bins;
frequency of ranges.
3.Line Graph: Shows trends over continuous variables or
time.
4.Pie Chart: Shows proportions of categories; easy for
comparisons but limited for close values.

1. SCATTER DIAGRAM
Also called scatter plot or X-Y graph.
Shows relationship between two variables
(e.g., height vs mass).
One variable on X-axis, another on Y-axis.
Reveals patterns but not causation directly.
Can indicate common causes or surrogate
variables.
1.12.3.Graphic Displays of Basic Statistical Descriptions

2. HISTOGRAM
Summarizes discrete or continuous data.
Data grouped into intervals called bins (e.g.,
10-19, 20-29).
Bars represent frequency/percentage in each
bin.
Width proportional to bin size; height
proportional to frequency.

3. LINE GRAPHS
Also called stick graphs.
Show relationships over continuous variables or
time (e.g., monthly rainfall).
Useful for identifying patterns, trends, seasonal
effects, turning points.
Multiple data series can be compared on the same
graph.
Simple to interpret and visualize changes
(peaks/valleys).

4. PIE CHARTS
Circle divided into sectors representing proportions of whole.
Easy to read "pie slices" showing relative sizes.
Common uses: Business product success, school time allocation,
home expenses.
Simple to understand which category is largest.
LIMITATION OF PIE CHART:
1.It is difficult to tell the difference between estimates of similar size.
Error bars or confidence limits cannot be shown on pie graph.
Legends and labels on pie graphs are hard to align and read.
2.The human visual system is more efficient at perceiving and discriminating between lines and line lengths
rather than two-dimensional areas and angles.
3.Pie graphs simply don't work when comparing data.

CS 3352 - UNIT 1 PPT INTRODUCTION DRGUPCE

More Related Content

Similar to CS 3352 - UNIT 1 PPT INTRODUCTION DRGUPCE

Recently uploaded

CS 3352 - UNIT 1 PPT INTRODUCTION DRGUPCE