SlideShare a Scribd company logo
1 of 57
DEPT & SEM :
SUBJECT NAME :
COURSE CODE :
UNIT :
PREPARED BY :
COURSE: FDS UNIT: 3 Pg. 1
MCA & V SEM
FUNDAMENTALS OF DATA SCIENCE
FDS
IV
Dr K HEMALATHA
COURSE: FDS UNIT: 3 Pg. 2
OUTLINE
• Data Objects and Types
• Measuring the center of Tendency
• Measuring the dispersion of the data
• Five Number summary
• Data Warehouse
COURSE: FDS UNIT: 3 Pg. 3
Data Objects and Attribute Types:
Data Objects:
A data object represents an entity.
sales database – customers, store items, and sales
medical database- patients
university database-students, professors, and courses
Data objects are typically described by attributes.
If the data objects are stored in a database, they are data tuples.
the rows of a database - data objects
the columns - the attributes.
COURSE: FDS UNIT: 3 Pg. 4
What Is an Attribute?
An attribute is a data field, representing a characteristic or feature of a data object.
The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature.
The term dimension is commonly used in data warehousing. Machine learning literature tends to use the term
feature, while statisticians prefer the term variable.
Data mining and database professionals commonly use the term attribute.
COURSE: FDS UNIT: 3 Pg. 5
Attributes describing a customer object can include, for example, customer ID, name, and address.
The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or numeric—the
attribute can have.
COURSE: FDS UNIT: 3 Pg. 6
Nominal Attributes:
Nominal means “relating to names.”
The values of a nominal attribute are symbols or names of things.
Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as
categorical.
Example:
Suppose that hair color and marital status are two attributes describing person objects.
Possible values for hair color are black, brown, blond, red, auburn, gray, and white.
The attribute marital status can take on the values single, married, divorced, and widowed.
Both hair color and marital status are nominal attributes.
Another example of a nominal attribute is occupation, with the values teacher, dentist, programmer, farmer, and so on.
COURSE: FDS UNIT: 3 Pg. 7
do not have any meaningful order about them and are not quantitative
Hence, no sense to find the mean (average) value or median (middle) value for such an attribute, given a set of
objects.
One thing that is of interest, however, is the attribute’s most commonly occurring value.
This value, known as the mode, is one of the measures of central tendency.
COURSE: FDS UNIT: 3 Pg. 8
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where
0 typically means that the attribute is absent, and 1 means that it is present.
Binary attributes are referred to as Boolean if the two states correspond to true and false.
COURSE: FDS UNIT: 3 Pg. 9
Examples:
Smoker: 1 indicates that the patient smokes,
0 indicates that the patient does not.
medical test: 1 means the result positive
0 means the result is negative.
COURSE: FDS UNIT: 3 Pg. 10
A binary attribute is symmetric if :
both of its states are equally valuable and carry the same weight;
that is, there is no preference on which outcome should be coded as 0 or 1
One such example could be the attribute gender having the states male and female.
A binary attribute is asymmetric if:
the outcomes of the states are not equally important
By convention, we code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by
0 (e.g., HIV negative).
COURSE: FDS UNIT: 3 Pg. 11
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but
the magnitude between successive values is not known.
Examples: The examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on) and professional rank.
Professional ranks can be enumerated in a sequential order: for example, assistant, associate, and full for professors,
and private, private first class, specialist, corporal, and sergeant for army ranks.
COURSE: FDS UNIT: 3 Pg. 12
Customer satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2: neutral,
3: satisfied, and 4: very satisfied.
The central tendency of an ordinal attribute can be represented by its mode and its median (the middle value in an
ordered sequence), but the mean cannot be defined.
COURSE: FDS UNIT: 3 Pg. 13
Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values.
Numeric attributes can be
interval-scaled
ratio-scaled
COURSE: FDS UNIT: 3 Pg. 14
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units.
The values of interval-scaled attributes have order and can be positive, 0, or negative.
Thus, in addition to providing a ranking of values, such attributes allow us to compare and quantify the difference
between values
COURSE: FDS UNIT: 3 Pg. 15
A temperature attribute is interval-scaled.
Suppose that we have the outdoor temperature value for a number of different days, where each day is an object.
By ordering the values, we obtain a ranking of the objects with respect to temperature.
In addition, we can quantify the difference between values. For example, a temperature of 20◦C is five degrees
higher than a temperature of 15◦C.
Calendar dates are another example. For instance, the years 2002 and 2010 are eight years apart.
COURSE: FDS UNIT: 3 Pg. 16
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point.
That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value.
In addition, the values are ordered, and we can also compute the difference between values, as well as the mean, median, and
mode.
COURSE: FDS UNIT: 3 Pg. 17
Examples. The examples of ratio-scaled attributes include count attributes such as years of experience (e.g., the
objects are employees) and number of words (e.g., the objects are documents).
Additional examples include attributes to measure weight, height, latitude and longitude coordinates.
COURSE: FDS UNIT: 3 Pg. 18
Discrete versus Continuous Attributes
Classification algorithms developed from the field of machine learning often talk of attributes as being either
discrete or continuous.
Each type may be processed differently.
A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as integers.
The attributes hair color, smoker, medical test, and drink size each have a finite number of values, and so are
discrete.
Note that discrete attributes may have numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110
for the attribute age.
COURSE: FDS UNIT: 3 Pg. 19
An attribute is countably infinite if the set of possible values is infinite but the values can be put in a one-to-
one correspondence with natural numbers.
For example, the attribute customer ID is countably infinite. The number of customers can grow to infinity, but
in reality, the actual set of values is countable (where the values can be put in one-to-one correspondence with
the set of integers).
Zip codes are another example. If an attribute is not discrete, it is continuous. The terms numeric attribute and
continuous attribute are often used interchangeably in the literature.
In practice, real values are represented using a finite number of digits. Continuous attributes are typically
represented as floating-point variables.
COURSE: FDS UNIT: 3 Pg. 20
Basic Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have an overall picture of your data.
Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be
treated as noise or outliers.
COURSE: FDS UNIT: 3 Pg. 21
Measuring the Central Tendency: Mean, Median, and Mode
Central Tendency : If we were plot the observations, where would most of the values fall?
Measures of central tendency:
Mean
Median
Mode
Midrange.
COURSE: FDS UNIT: 3 Pg. 22
Mean
The most common and effective numeric measure of the “center” of a set of data is the (arithmetic) mean.
corresponds to the built-in aggregate function, average (avg() in SQL)
COURSE: FDS UNIT: 3 Pg. 23
Example: Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, and 110.
Thus, the mean salary is $58,000.
COURSE: FDS UNIT: 3 Pg. 24
weighted arithmetic mean or the weighted average.
COURSE: FDS UNIT: 3 Pg. 25
The single most useful quantity for describing a dataset
Not always the best way of measuring the center of the data.
A major problem - sensitivity to extreme (e.g., outlier) values.
A small number of extreme values can corrupt the mean.
The mean salary at a company may be substantially pushed up by that of a few highly paid managers.
Similarly, the mean score of a class in an exam could be pulled down quite a bit by a few very low scores.
COURSE: FDS UNIT: 3 Pg. 26
To offset the effect, we can instead use the trimmed mean
which is the mean obtained after chopping off values at the high and low extremes.
For example, we can sort the values observed for salary and remove the top and bottom 2% before computing the
mean.
We should avoid trimming too large a portion (such as 20%) at both ends, as this can result in the loss of valuable
information.
For skewed (asymmetric) data, a better measure of the center of data is the median
COURSE: FDS UNIT: 3 Pg. 27
Median
In probability and statistics, the median generally applies to numeric data;
however, we may extend the concept to ordinal data.
Suppose that a given data set of N values for an attribute X is sorted in increasing order.
If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any value in between.
If X is a numeric attribute in this case, by convention, the median is taken as the average of the two middlemost
values.
COURSE: FDS UNIT: 3 Pg. 28
The median is expensive to compute when we have a large number of observations.
For numeric attributes, however, we can easily approximate the value.
Assume that data are grouped in intervals according to their xi data values and that the frequency (i.e., number of
data values) of each interval is known.
For example, employees may be grouped according to their annual salary in intervals such as $10–20,000, $20–
30,000, and so on.
COURSE: FDS UNIT: 3 Pg. 29
•where L1 is the lower boundary of the median interval,
N is the number of values in the entire data set,
is the sum of the frequencies of all of the intervals that are lower than
the median interval,
freqmedian is the frequency of the median interval,
width is the width of the median interval.
COURSE: FDS UNIT: 3 Pg. 30
Mode
The mode is another measure of central tendency.
The mode for a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative attributes.
It is possible for the greatest frequency to correspond to several different values, which results in more than one
mode.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
COURSE: FDS UNIT: 3 Pg. 31
In general, a dataset with two or more modes is multimodal. At the other extreme, if each data value occurs only
once, then there is no mode.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the following empirical relation:
mean−mode≈3×(mean−median).
This implies that the mode for unimodal frequency curves that are moderately skewed can easily be approximated if
the mean and median values are known.
COURSE: FDS UNIT: 3 Pg. 32
Midrange
The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set.
This measure is easy to compute using the SQL aggregate functions, max() and min().
Example: The midrange of the data is 30,000+110,000 / 2 =$70,000.
COURSE: FDS UNIT: 3 Pg. 33
COURSE: FDS UNIT: 3 Pg. 34
Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range:
The measures to assess the dispersion or spread of numeric data include range, quantiles, quartiles, percentiles,
and the interquartile range.
The five-number summary, which can be displayed as a boxplot, is useful in identifying outliers.
Variance and standard deviation also indicate the spread of a data distribution.
COURSE: FDS UNIT: 3 Pg. 35
Range, Quartiles, and Interquartile Range
The range of the set is the difference between the largest (max()) and smallest (min()) values.
Imagine that we can pick certain data points so as to split the data distribution into equal-size consecutive sets.
These data points are called quantiles.
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal size
consecutive sets.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds to the
median.
COURSE: FDS UNIT: 3 Pg. 36
The 4-quantiles are the three data points that split the data distribution into four equal parts; each part represents
one-fourth of the data distribution. They are more commonly referred to as quartiles.
The 100-quantiles are more commonly referred to as percentiles; they divide the data distribution into100 equal-
sized consecutive sets. The median, quartiles, and percentiles are the most widely used forms of quantiles.
COURSE: FDS UNIT: 3 Pg. 37
COURSE: FDS UNIT: 3 Pg. 38
The quartiles give an indication of a distribution’s center, spread, and shape.
The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data.
The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data.
The second quartile is the 50th percentile. As the median, it gives the center of the data distribution.
COURSE: FDS UNIT: 3 Pg. 39
The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the
middle half of the data.
This distance is called the inter quartile range (IQR) and is defined as IQR=Q3−Q1.
COURSE: FDS UNIT: 3 Pg. 40
Five-Number Summary, Boxplots, and Outliers
No single numeric measure of spread (e.g., IQR) is very useful for describing skewed distributions.
In the symmetric distribution, the median (and other measures of central tendency) splits the data into equal-
size halves. This does not occur for skewed distributions.
Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along with the median.
A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5×IQR above
the third quartile or below the first quartile.
COURSE: FDS UNIT: 3 Pg. 41
Because Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails) of the data, a
fuller summary of the shape of a distribution can be obtained by providing the lowest and highest data values as
well. This is known as the five-number summary.
The five-number summary of a distribution consists of
the median (Q2),
the quartiles Q1 and Q3,
and the smallest and largest individual observations
written in the order of Minimum, Q1, Median, Q3, Maximum.
COURSE: FDS UNIT: 3 Pg. 42
Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles so that the box length is the interquartile range.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations.
COURSE: FDS UNIT: 3 Pg. 43
COURSE: FDS UNIT: 3 Pg. 44
Variance and Standard Deviation
Variance and standard deviation are measures of data dispersion.
They indicate how spread out a data distribution is.
A low standard deviation means that the data observations tend to be very close to the mean, while a high standard
deviation indicates that the data are spread out over a large range of values.
COURSE: FDS UNIT: 3 Pg. 45
The variance of N observations, x1,x2,...,xN, for a numeric attribute X is
where 𝑥 is the mean value of the observations
The standard deviation, σ,
the variance, σ2.
COURSE: FDS UNIT: 3 Pg. 46
The computation of the variance and standard deviation is scalable in large databases
COURSE: FDS UNIT: 3 Pg. 47
CS 336 47
Problem: Heterogeneous Information Sources
“Heterogeneities are everywhere”
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
Personal
Databases
Digital Libraries
Scientific Databases
World
Wide
Web
COURSE: FDS UNIT: 3 Pg. 48
COURSE: FDS UNIT: 3 Pg. 49
COURSE: FDS UNIT: 3 Pg. 50
COURSE: FDS UNIT: 3 Pg. 51
COURSE: FDS UNIT: 3 Pg. 52
Differences between Operational Database Systems and Data Warehouses
S.No. Data Warehouse (OLAP) Operational Database(OLTP)
1 It involves historical processing of information. It involves day-to-day processing.
2 OLAP systems are used by knowledge workers such as executives,
managers, and analysts.
OLTP systems are used by clerks, DBAs, or database professionals.
3 It is used to analyze the business. It is used to run the business.
4 It focuses on Information out. It focuses on Data in.
5 It is based on Star Schema, Snowflake Schema, and Fact Constellation
Schema.
It is based on Entity Relationship Model.
6 It focuses on Information out. It is application oriented.
7 It contains historical data. It contains current data.
8 It provides summarized and consolidated data. It provides primitive and highly detailed data.
9 It provides summarized and multidimensional view of data. It provides detailed and flat relational view of data.
10 The number of users is in hundreds. The number of users is in thousands.
11 The number of records accessed is in millions. The number of records accessed is in tens.
12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB.
13 These are highly flexible. It provides high performance.
COURSE: FDS UNIT: 3 Pg. 53
Data Warehousing: A Multitiered Architecture
•Bottom Tier − The bottom tier of the architecture is
the data warehouse database server. It is the
relational database system. We use the back end
tools and utilities to feed data into the bottom tier.
These back end tools and utilities perform the Extract,
Clean, Load, and refresh functions.
•Middle Tier − In the middle tier, we have the OLAP
Server that can be implemented in either of the
following ways.
• By Relational OLAP (ROLAP), which is an
extended relational database management
system. The ROLAP maps the operations on
multidimensional data to standard relational
operations.
• By Multidimensional OLAP (MOLAP) model,
which directly implements the
multidimensional data and operations.
•Top-Tier − This tier is the front-end client layer. This
layer holds the query tools and reporting tools,
analysis tools and data mining tools.
COURSE: FDS UNIT: 3 Pg. 54
Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual
Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse.
It is easy to build a virtual warehouse.
Building a virtual warehouse requires excess capacity on operational database servers.
COURSE: FDS UNIT: 3 Pg. 55
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are
confined to subjects.
Points to remember about data marts −
Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-
cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or
years.
The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide.
Data marts are small in size.
Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.
Data mart are flexible.
COURSE: FDS UNIT: 3 Pg. 56
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
COURSE: FDS UNIT: 3 Pg. 57
Extraction, Transformation, and Loading
Data warehouse systems use back-end tools and utilities to populate and refresh their data.
These tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions. Refresh, which propagates the updates from the data sources to the warehouse.
Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse systems usually provide a
good set of data warehouse management tools.
Data cleaning and data transformation are important steps in improving the data quality and, subsequently,
the data mining results.

More Related Content

Similar to FDS PPT_Unit-5.pptx fundamentals of data science

Desigining of Database - ER Model
Desigining of Database - ER ModelDesigining of Database - ER Model
Desigining of Database - ER ModelAjay Chhimpa
 
Statistics.pptx
Statistics.pptxStatistics.pptx
Statistics.pptxXadDax1
 
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptxMMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptxPETTIROSETALISIC
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1Ashwini Mathur
 
3. Chapter Three.pdf
3. Chapter Three.pdf3. Chapter Three.pdf
3. Chapter Three.pdffikadumola
 
Entity relationship diagram (erd)
Entity relationship  diagram (erd)Entity relationship  diagram (erd)
Entity relationship diagram (erd)Shahariar Alam
 
Fundamentals of database system - Data Modeling Using the Entity-Relationshi...
Fundamentals of database system  - Data Modeling Using the Entity-Relationshi...Fundamentals of database system  - Data Modeling Using the Entity-Relationshi...
Fundamentals of database system - Data Modeling Using the Entity-Relationshi...Mustafa Kamel Mohammadi
 
Introduction to Data (1).pptx
Introduction to Data (1).pptxIntroduction to Data (1).pptx
Introduction to Data (1).pptxSubhamitaKanungo
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship DiagramRakhi Mukherji
 
Jobs manager vs supervisor.pptx
Jobs manager vs supervisor.pptxJobs manager vs supervisor.pptx
Jobs manager vs supervisor.pptxprosofts1
 
Statistics: Chapter One
Statistics: Chapter OneStatistics: Chapter One
Statistics: Chapter OneSaed Jama
 
Download different material from slide share
Download different material from slide shareDownload different material from slide share
Download different material from slide sharefanta teferi
 

Similar to FDS PPT_Unit-5.pptx fundamentals of data science (20)

Desigining of Database - ER Model
Desigining of Database - ER ModelDesigining of Database - ER Model
Desigining of Database - ER Model
 
Statistics.pptx
Statistics.pptxStatistics.pptx
Statistics.pptx
 
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptxMMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1
 
3. Chapter Three.pdf
3. Chapter Three.pdf3. Chapter Three.pdf
3. Chapter Three.pdf
 
Types of data
Types of dataTypes of data
Types of data
 
Entity relationship diagram (erd)
Entity relationship  diagram (erd)Entity relationship  diagram (erd)
Entity relationship diagram (erd)
 
Fundamentals of database system - Data Modeling Using the Entity-Relationshi...
Fundamentals of database system  - Data Modeling Using the Entity-Relationshi...Fundamentals of database system  - Data Modeling Using the Entity-Relationshi...
Fundamentals of database system - Data Modeling Using the Entity-Relationshi...
 
Data analysis copy
Data analysis   copyData analysis   copy
Data analysis copy
 
Basic math
Basic mathBasic math
Basic math
 
Introduction to Data (1).pptx
Introduction to Data (1).pptxIntroduction to Data (1).pptx
Introduction to Data (1).pptx
 
Entity Relationship Diagram
Entity Relationship DiagramEntity Relationship Diagram
Entity Relationship Diagram
 
Jobs manager vs supervisor.pptx
Jobs manager vs supervisor.pptxJobs manager vs supervisor.pptx
Jobs manager vs supervisor.pptx
 
SIMPLE START TO BIOSTATISTICS PART 1.0.pdf
SIMPLE START TO BIOSTATISTICS PART 1.0.pdfSIMPLE START TO BIOSTATISTICS PART 1.0.pdf
SIMPLE START TO BIOSTATISTICS PART 1.0.pdf
 
Machine learning session1
Machine learning   session1Machine learning   session1
Machine learning session1
 
UNIT II DBMS.pptx
UNIT II DBMS.pptxUNIT II DBMS.pptx
UNIT II DBMS.pptx
 
Er model
Er modelEr model
Er model
 
Statistics: Chapter One
Statistics: Chapter OneStatistics: Chapter One
Statistics: Chapter One
 
Download different material from slide share
Download different material from slide shareDownload different material from slide share
Download different material from slide share
 
ER Model in DBMS
ER Model in DBMSER Model in DBMS
ER Model in DBMS
 

More from JyoReddy9

operating system over view.ppt operating sysyems
operating system over view.ppt operating sysyemsoperating system over view.ppt operating sysyems
operating system over view.ppt operating sysyemsJyoReddy9
 
Module 2ppt.pptx divid and conquer method
Module 2ppt.pptx divid and conquer methodModule 2ppt.pptx divid and conquer method
Module 2ppt.pptx divid and conquer methodJyoReddy9
 
UNIT-V.pdf daa unit material 5 th unit ppt
UNIT-V.pdf daa unit material 5 th unit pptUNIT-V.pdf daa unit material 5 th unit ppt
UNIT-V.pdf daa unit material 5 th unit pptJyoReddy9
 
UNIT-II.pptx
UNIT-II.pptxUNIT-II.pptx
UNIT-II.pptxJyoReddy9
 
DISJOINT SETS.pptx
DISJOINT SETS.pptxDISJOINT SETS.pptx
DISJOINT SETS.pptxJyoReddy9
 
os storage mass.ppt
os storage mass.pptos storage mass.ppt
os storage mass.pptJyoReddy9
 

More from JyoReddy9 (6)

operating system over view.ppt operating sysyems
operating system over view.ppt operating sysyemsoperating system over view.ppt operating sysyems
operating system over view.ppt operating sysyems
 
Module 2ppt.pptx divid and conquer method
Module 2ppt.pptx divid and conquer methodModule 2ppt.pptx divid and conquer method
Module 2ppt.pptx divid and conquer method
 
UNIT-V.pdf daa unit material 5 th unit ppt
UNIT-V.pdf daa unit material 5 th unit pptUNIT-V.pdf daa unit material 5 th unit ppt
UNIT-V.pdf daa unit material 5 th unit ppt
 
UNIT-II.pptx
UNIT-II.pptxUNIT-II.pptx
UNIT-II.pptx
 
DISJOINT SETS.pptx
DISJOINT SETS.pptxDISJOINT SETS.pptx
DISJOINT SETS.pptx
 
os storage mass.ppt
os storage mass.pptos storage mass.ppt
os storage mass.ppt
 

Recently uploaded

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Recently uploaded (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

FDS PPT_Unit-5.pptx fundamentals of data science

  • 1. DEPT & SEM : SUBJECT NAME : COURSE CODE : UNIT : PREPARED BY : COURSE: FDS UNIT: 3 Pg. 1 MCA & V SEM FUNDAMENTALS OF DATA SCIENCE FDS IV Dr K HEMALATHA
  • 2. COURSE: FDS UNIT: 3 Pg. 2 OUTLINE • Data Objects and Types • Measuring the center of Tendency • Measuring the dispersion of the data • Five Number summary • Data Warehouse
  • 3. COURSE: FDS UNIT: 3 Pg. 3 Data Objects and Attribute Types: Data Objects: A data object represents an entity. sales database – customers, store items, and sales medical database- patients university database-students, professors, and courses Data objects are typically described by attributes. If the data objects are stored in a database, they are data tuples. the rows of a database - data objects the columns - the attributes.
  • 4. COURSE: FDS UNIT: 3 Pg. 4 What Is an Attribute? An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature. The term dimension is commonly used in data warehousing. Machine learning literature tends to use the term feature, while statisticians prefer the term variable. Data mining and database professionals commonly use the term attribute.
  • 5. COURSE: FDS UNIT: 3 Pg. 5 Attributes describing a customer object can include, for example, customer ID, name, and address. The type of an attribute is determined by the set of possible values—nominal, binary, ordinal, or numeric—the attribute can have.
  • 6. COURSE: FDS UNIT: 3 Pg. 6 Nominal Attributes: Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things. Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. Example: Suppose that hair color and marital status are two attributes describing person objects. Possible values for hair color are black, brown, blond, red, auburn, gray, and white. The attribute marital status can take on the values single, married, divorced, and widowed. Both hair color and marital status are nominal attributes. Another example of a nominal attribute is occupation, with the values teacher, dentist, programmer, farmer, and so on.
  • 7. COURSE: FDS UNIT: 3 Pg. 7 do not have any meaningful order about them and are not quantitative Hence, no sense to find the mean (average) value or median (middle) value for such an attribute, given a set of objects. One thing that is of interest, however, is the attribute’s most commonly occurring value. This value, known as the mode, is one of the measures of central tendency.
  • 8. COURSE: FDS UNIT: 3 Pg. 8 Binary Attributes A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. Binary attributes are referred to as Boolean if the two states correspond to true and false.
  • 9. COURSE: FDS UNIT: 3 Pg. 9 Examples: Smoker: 1 indicates that the patient smokes, 0 indicates that the patient does not. medical test: 1 means the result positive 0 means the result is negative.
  • 10. COURSE: FDS UNIT: 3 Pg. 10 A binary attribute is symmetric if : both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1 One such example could be the attribute gender having the states male and female. A binary attribute is asymmetric if: the outcomes of the states are not equally important By convention, we code the most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).
  • 11. COURSE: FDS UNIT: 3 Pg. 11 Ordinal Attributes An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. Examples: The examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on) and professional rank. Professional ranks can be enumerated in a sequential order: for example, assistant, associate, and full for professors, and private, private first class, specialist, corporal, and sergeant for army ranks.
  • 12. COURSE: FDS UNIT: 3 Pg. 12 Customer satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied. The central tendency of an ordinal attribute can be represented by its mode and its median (the middle value in an ordered sequence), but the mean cannot be defined.
  • 13. COURSE: FDS UNIT: 3 Pg. 13 Numeric Attributes A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled ratio-scaled
  • 14. COURSE: FDS UNIT: 3 Pg. 14 Interval-Scaled Attributes Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a ranking of values, such attributes allow us to compare and quantify the difference between values
  • 15. COURSE: FDS UNIT: 3 Pg. 15 A temperature attribute is interval-scaled. Suppose that we have the outdoor temperature value for a number of different days, where each day is an object. By ordering the values, we obtain a ranking of the objects with respect to temperature. In addition, we can quantify the difference between values. For example, a temperature of 20◦C is five degrees higher than a temperature of 15◦C. Calendar dates are another example. For instance, the years 2002 and 2010 are eight years apart.
  • 16. COURSE: FDS UNIT: 3 Pg. 16 Ratio-Scaled Attributes A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. In addition, the values are ordered, and we can also compute the difference between values, as well as the mean, median, and mode.
  • 17. COURSE: FDS UNIT: 3 Pg. 17 Examples. The examples of ratio-scaled attributes include count attributes such as years of experience (e.g., the objects are employees) and number of words (e.g., the objects are documents). Additional examples include attributes to measure weight, height, latitude and longitude coordinates.
  • 18. COURSE: FDS UNIT: 3 Pg. 18 Discrete versus Continuous Attributes Classification algorithms developed from the field of machine learning often talk of attributes as being either discrete or continuous. Each type may be processed differently. A discrete attribute has a finite or countably infinite set of values, which may or may not be represented as integers. The attributes hair color, smoker, medical test, and drink size each have a finite number of values, and so are discrete. Note that discrete attributes may have numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110 for the attribute age.
  • 19. COURSE: FDS UNIT: 3 Pg. 19 An attribute is countably infinite if the set of possible values is infinite but the values can be put in a one-to- one correspondence with natural numbers. For example, the attribute customer ID is countably infinite. The number of customers can grow to infinity, but in reality, the actual set of values is countable (where the values can be put in one-to-one correspondence with the set of integers). Zip codes are another example. If an attribute is not discrete, it is continuous. The terms numeric attribute and continuous attribute are often used interchangeably in the literature. In practice, real values are represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
  • 20. COURSE: FDS UNIT: 3 Pg. 20 Basic Statistical Descriptions of Data For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic statistical descriptions can be used to identify properties of the data and highlight which data values should be treated as noise or outliers.
  • 21. COURSE: FDS UNIT: 3 Pg. 21 Measuring the Central Tendency: Mean, Median, and Mode Central Tendency : If we were plot the observations, where would most of the values fall? Measures of central tendency: Mean Median Mode Midrange.
  • 22. COURSE: FDS UNIT: 3 Pg. 22 Mean The most common and effective numeric measure of the “center” of a set of data is the (arithmetic) mean. corresponds to the built-in aggregate function, average (avg() in SQL)
  • 23. COURSE: FDS UNIT: 3 Pg. 23 Example: Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, and 110. Thus, the mean salary is $58,000.
  • 24. COURSE: FDS UNIT: 3 Pg. 24 weighted arithmetic mean or the weighted average.
  • 25. COURSE: FDS UNIT: 3 Pg. 25 The single most useful quantity for describing a dataset Not always the best way of measuring the center of the data. A major problem - sensitivity to extreme (e.g., outlier) values. A small number of extreme values can corrupt the mean. The mean salary at a company may be substantially pushed up by that of a few highly paid managers. Similarly, the mean score of a class in an exam could be pulled down quite a bit by a few very low scores.
  • 26. COURSE: FDS UNIT: 3 Pg. 26 To offset the effect, we can instead use the trimmed mean which is the mean obtained after chopping off values at the high and low extremes. For example, we can sort the values observed for salary and remove the top and bottom 2% before computing the mean. We should avoid trimming too large a portion (such as 20%) at both ends, as this can result in the loss of valuable information. For skewed (asymmetric) data, a better measure of the center of data is the median
  • 27. COURSE: FDS UNIT: 3 Pg. 27 Median In probability and statistics, the median generally applies to numeric data; however, we may extend the concept to ordinal data. Suppose that a given data set of N values for an attribute X is sorted in increasing order. If N is odd, then the median is the middle value of the ordered set. If N is even, then the median is not unique; it is the two middlemost values and any value in between. If X is a numeric attribute in this case, by convention, the median is taken as the average of the two middlemost values.
  • 28. COURSE: FDS UNIT: 3 Pg. 28 The median is expensive to compute when we have a large number of observations. For numeric attributes, however, we can easily approximate the value. Assume that data are grouped in intervals according to their xi data values and that the frequency (i.e., number of data values) of each interval is known. For example, employees may be grouped according to their annual salary in intervals such as $10–20,000, $20– 30,000, and so on.
  • 29. COURSE: FDS UNIT: 3 Pg. 29 •where L1 is the lower boundary of the median interval, N is the number of values in the entire data set, is the sum of the frequencies of all of the intervals that are lower than the median interval, freqmedian is the frequency of the median interval, width is the width of the median interval.
  • 30. COURSE: FDS UNIT: 3 Pg. 30 Mode The mode is another measure of central tendency. The mode for a set of data is the value that occurs most frequently in the set. Therefore, it can be determined for qualitative and quantitative attributes. It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
  • 31. COURSE: FDS UNIT: 3 Pg. 31 In general, a dataset with two or more modes is multimodal. At the other extreme, if each data value occurs only once, then there is no mode. For unimodal numeric data that are moderately skewed (asymmetrical), we have the following empirical relation: mean−mode≈3×(mean−median). This implies that the mode for unimodal frequency curves that are moderately skewed can easily be approximated if the mean and median values are known.
  • 32. COURSE: FDS UNIT: 3 Pg. 32 Midrange The midrange can also be used to assess the central tendency of a numeric data set. It is the average of the largest and smallest values in the set. This measure is easy to compute using the SQL aggregate functions, max() and min(). Example: The midrange of the data is 30,000+110,000 / 2 =$70,000.
  • 33. COURSE: FDS UNIT: 3 Pg. 33
  • 34. COURSE: FDS UNIT: 3 Pg. 34 Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range: The measures to assess the dispersion or spread of numeric data include range, quantiles, quartiles, percentiles, and the interquartile range. The five-number summary, which can be displayed as a boxplot, is useful in identifying outliers. Variance and standard deviation also indicate the spread of a data distribution.
  • 35. COURSE: FDS UNIT: 3 Pg. 35 Range, Quartiles, and Interquartile Range The range of the set is the difference between the largest (max()) and smallest (min()) values. Imagine that we can pick certain data points so as to split the data distribution into equal-size consecutive sets. These data points are called quantiles. Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equal size consecutive sets. The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds to the median.
  • 36. COURSE: FDS UNIT: 3 Pg. 36 The 4-quantiles are the three data points that split the data distribution into four equal parts; each part represents one-fourth of the data distribution. They are more commonly referred to as quartiles. The 100-quantiles are more commonly referred to as percentiles; they divide the data distribution into100 equal- sized consecutive sets. The median, quartiles, and percentiles are the most widely used forms of quantiles.
  • 37. COURSE: FDS UNIT: 3 Pg. 37
  • 38. COURSE: FDS UNIT: 3 Pg. 38 The quartiles give an indication of a distribution’s center, spread, and shape. The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The second quartile is the 50th percentile. As the median, it gives the center of the data distribution.
  • 39. COURSE: FDS UNIT: 3 Pg. 39 The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the inter quartile range (IQR) and is defined as IQR=Q3−Q1.
  • 40. COURSE: FDS UNIT: 3 Pg. 40 Five-Number Summary, Boxplots, and Outliers No single numeric measure of spread (e.g., IQR) is very useful for describing skewed distributions. In the symmetric distribution, the median (and other measures of central tendency) splits the data into equal- size halves. This does not occur for skewed distributions. Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along with the median. A common rule of thumb for identifying suspected outliers is to single out values falling at least 1.5×IQR above the third quartile or below the first quartile.
  • 41. COURSE: FDS UNIT: 3 Pg. 41 Because Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing the lowest and highest data values as well. This is known as the five-number summary. The five-number summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest and largest individual observations written in the order of Minimum, Q1, Median, Q3, Maximum.
  • 42. COURSE: FDS UNIT: 3 Pg. 42 Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows: Typically, the ends of the box are at the quartiles so that the box length is the interquartile range. The median is marked by a line within the box. Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations.
  • 43. COURSE: FDS UNIT: 3 Pg. 43
  • 44. COURSE: FDS UNIT: 3 Pg. 44 Variance and Standard Deviation Variance and standard deviation are measures of data dispersion. They indicate how spread out a data distribution is. A low standard deviation means that the data observations tend to be very close to the mean, while a high standard deviation indicates that the data are spread out over a large range of values.
  • 45. COURSE: FDS UNIT: 3 Pg. 45 The variance of N observations, x1,x2,...,xN, for a numeric attribute X is where 𝑥 is the mean value of the observations The standard deviation, σ, the variance, σ2.
  • 46. COURSE: FDS UNIT: 3 Pg. 46 The computation of the variance and standard deviation is scalable in large databases
  • 47. COURSE: FDS UNIT: 3 Pg. 47 CS 336 47 Problem: Heterogeneous Information Sources “Heterogeneities are everywhere”  Different interfaces  Different data representations  Duplicate and inconsistent information Personal Databases Digital Libraries Scientific Databases World Wide Web
  • 48. COURSE: FDS UNIT: 3 Pg. 48
  • 49. COURSE: FDS UNIT: 3 Pg. 49
  • 50. COURSE: FDS UNIT: 3 Pg. 50
  • 51. COURSE: FDS UNIT: 3 Pg. 51
  • 52. COURSE: FDS UNIT: 3 Pg. 52 Differences between Operational Database Systems and Data Warehouses S.No. Data Warehouse (OLAP) Operational Database(OLTP) 1 It involves historical processing of information. It involves day-to-day processing. 2 OLAP systems are used by knowledge workers such as executives, managers, and analysts. OLTP systems are used by clerks, DBAs, or database professionals. 3 It is used to analyze the business. It is used to run the business. 4 It focuses on Information out. It focuses on Data in. 5 It is based on Star Schema, Snowflake Schema, and Fact Constellation Schema. It is based on Entity Relationship Model. 6 It focuses on Information out. It is application oriented. 7 It contains historical data. It contains current data. 8 It provides summarized and consolidated data. It provides primitive and highly detailed data. 9 It provides summarized and multidimensional view of data. It provides detailed and flat relational view of data. 10 The number of users is in hundreds. The number of users is in thousands. 11 The number of records accessed is in millions. The number of records accessed is in tens. 12 The database size is from 100GB to 100 TB. The database size is from 100 MB to 100 GB. 13 These are highly flexible. It provides high performance.
  • 53. COURSE: FDS UNIT: 3 Pg. 53 Data Warehousing: A Multitiered Architecture •Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions. •Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the following ways. • By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. • By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations. •Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools.
  • 54. COURSE: FDS UNIT: 3 Pg. 54 Data Warehouse Models: Enterprise Warehouse, Data Mart, and Virtual Warehouse Virtual Warehouse The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
  • 55. COURSE: FDS UNIT: 3 Pg. 55 Data Mart Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization. In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects. Points to remember about data marts − Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low- cost servers. The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than months or years. The life cycle of a data mart may be complex in long run, if its planning and design are not organization-wide. Data marts are small in size. Data marts are customized by department. The source of a data mart is departmentally structured data warehouse. Data mart are flexible.
  • 56. COURSE: FDS UNIT: 3 Pg. 56 Enterprise Warehouse An enterprise warehouse collects all the information and the subjects spanning an entire organization It provides us enterprise-wide data integration. The data is integrated from operational systems and external information providers. This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
  • 57. COURSE: FDS UNIT: 3 Pg. 57 Extraction, Transformation, and Loading Data warehouse systems use back-end tools and utilities to populate and refresh their data. These tools and utilities include the following functions: Data extraction, which typically gathers data from multiple, heterogeneous, and external sources. Data cleaning, which detects errors in the data and rectifies them when possible. Data transformation, which converts data from legacy or host format to warehouse format. Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions. Refresh, which propagates the updates from the data sources to the warehouse. Besides cleaning, loading, refreshing, and metadata definition tools, data warehouse systems usually provide a good set of data warehouse management tools. Data cleaning and data transformation are important steps in improving the data quality and, subsequently, the data mining results.