SlideShare a Scribd company logo
1 of 25
A. D. Patel Institute Of Technology
Data Mining And Business Intelligence (2170715): A. Y. 2019-20
Data Compression – Numerosity Reduction
Prepared By :
Dhruv V. Shah (160010116053)
B.E. (IT) Sem - VII
Guided By :
Prof. Ravi D. Patel
(Dept Of IT , ADIT)
Department Of Information Technology
A.D. Patel Institute Of Technology (ADIT)
New Vallabh Vidyanagar , Anand , Gujarat
1
Outline
 Introduction
 Data Reduction Strategies
 Numerosity Reduction
 Numerosity Reduction Methods
1) Parametric Methods
1.1) Regression
1.2) Log-Linear Model
2) Non-Parametric Methods
2.1) Histograms
2.2) Clustering
2.3) Sampling
2.4) Data Cube Aggregation.
 References
2
 Why Need Data Reduction?
 A database/data warehouse may store terabytes of data.
 Complex data analysis/mining may take a very long time to run on the complete data set.
3
 Data Reduction:
Introduction
 Data Reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data.
 That, is Mining on the reduced data set should be more efficient yet produce the same
analytical results.
Data Reduction Strategies
4
 Data cube aggregation
 Attribute Subset Selection
 Numerosity reduction — e.g., fit data into models
 Dimensionality reduction - Data Compression
 Discretization and concept hierarchy generation
Numerosity Reduction
5
 What is Numerosity Reduction?
 These techniques replace the original data volume by alternative, smaller forms of data
representation.
 There are two techniques for numerosity reduction methods.
1) Parametric
2) Non-Parametric
Numerosity Reduction Methods
1) Parametric Methods :
 A model is used to estimate the data, so that only the data parameters need to be restored and
not the actual data.
 It assumes that the data fits some model estimates model parameters.
 The Regression and Log-Linear methods are used for creating such models.
 Regression :
 Regression can be a simple linear regression or multiple linear regression.
 When there is only single independent attribute, such regression model is called simple linear
regression and if there are multiple independent attributes, then such regression models are
called multiple linear regression.
 In linear regression, the data are modeled to a fit straight line.
6
Cont.…
7
 For example,
a random variable y can be modeled as a linear function of another random variable x with the
equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the
line, respectively.
In multiple linear regression, y will be modeled as a linear function of two or more
predictor(independent) variables.
 Log-Linear Model :
 Log-linear model can be used to estimate the probability of each data point in a
multidimensional space for a set of discretized attributes, based on a smaller subset of
dimensional combinations.
 This allows a higher-dimensional data space to be constructed from lower-dimensional
attributes.
 Regression and log-linear model can both be used on sparse data, although their application
may be limited.
2) Non-Parametric Methods :
 Do not assume the data.
 These methods are used for storing reduced representations of the data include histograms,
clustering, sampling and data cube aggregation.
8
Cont.…
1) Histograms :
 Divide data into buckets and store average (sum) for each bucket.
 Partitioning rules:
1) Equal-width:
Equal bucket range
2) Equal-frequency (or equal-depth) :
It uses binning to approximate data distribution
 Binning Method :
 Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
 Smoothing by bin means:
Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4
Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4
Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4
9
Cont.…
3) V-optimal:
with the least histogram variance (weighted sum of the original values that each
bucket represents)
4) MaxDiff:
Consider difference between pair of adjacent values. Set bucket boundary between
each pair for pairs having the β (No. of buckets)–1 largest differences
Cont….
10
 Multi-dimensional histogram
Fig. Histogram with Singleton buckets
Cont.…
11
Fig. Equal-width Histogram
 List of prices:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18,
20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
2) Clustering :
 Clustering divides the data into groups/clusters.
 This technique partitions the whole data into different clusters.
 In data reduction, the cluster representation of the data are used to replace the actual data.
 It also helps to detect outliers in data.
12
13
C1 C2
C3
Fig. Clustering
14
3) Sampling :
 Sampling obtaining a small sample s to represent the whole data set N
 Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the
data
 Choose a representative subset of the data
 Simple random sampling may have very poor performance in the presence of skew
 Develop adaptive sampling methods
 Stratified sampling
 Approximate the percentage of each class (or subpopulation of interest) in the
overall database.
 Used in conjunction with skewed data.
 Sampling may not reduce database I/Os (page at a time).
15
Sampling Techniques :
 Simple Random Sample Without Replacement (SRSWOR)
 Simple Random Sample With Replacement (SRSWR)
 Cluster Sample
 Stratified Sample
Sampling Random Sample with or without Replacement
Fig. SRSWOR & SRSWR
16
Raw Data
Cluster Sample
17
 Tuples are grouped into M mutually disjoint clusters
 SRS of m clusters is taken where m < M
 Tuples in a database retrieved in pages
 Page - Cluster
 SRSWOR to pages
Stratified Sample
18
 Data is divided into mutually disjoint parts called strata
 SRS at each stratum
 Representative samples ensured even in the presence of skewed data
Cluster and Stratified Sampling
19
Fig. Cluster & Stratified Sampling
Features of Sampling :
 Cost depends on size of sample.
 Sub-linear on size of data.
 Linear with respect to dimensions.
 Estimates answer to an aggregate query.
20
21
3) Data Cube Aggregation: :
 A data cube is generally used to easily interpret data. It is especially useful when representing
data together with dimensions as certain measures of business requirements.
 A cube's every dimension represents certain characteristic of the database.
 Data Cubes store multidimensional aggregated information.
 Data cubes provide fast access to precomputed, summarized data, thereby benefiting online
analytical processing (OLAP) as well as data mining.
22
Categories of Data Cube :
 Dimensions:
 Represents categories of data such as time or location.
 Each dimension includes different levels of categories.
 Example :
23
Categories of Data Cube :
 Measures:
 These are the actual data values that occupy the cells as defined by the dimensions selected.
 Measures include facts or variables typically stored as numerical fields.
 Example :
24
References
 https://en.wikipedia.org/wiki/Data_cube
 https://www.geeksforgeeks.org/numerosity-reduction-in-data-mining/
 http://www.lastnightstudy.com/Show?id=44/Data-Reduction-In-Data-Mining
25

More Related Content

What's hot

3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering TypesAshwin Shenoy M
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reductionShatakirti Er
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)snegacmr
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 

What's hot (20)

3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Data Mining
Data MiningData Mining
Data Mining
 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Chapter8
Chapter8Chapter8
Chapter8
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Dimensionality reduction
Dimensionality reductionDimensionality reduction
Dimensionality reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 

Similar to Data Compression in Data mining and Business Intelligencs

Student_Garden_geostatistics_course
Student_Garden_geostatistics_courseStudent_Garden_geostatistics_course
Student_Garden_geostatistics_coursePedro Correia
 
Student_Garden_geostatistics_course
Student_Garden_geostatistics_courseStudent_Garden_geostatistics_course
Student_Garden_geostatistics_coursePedro Correia
 
Comparison between cube techniques
Comparison between cube techniquesComparison between cube techniques
Comparison between cube techniquesijsrd.com
 
Data preperation
Data preperationData preperation
Data preperationFraboni Ec
 
Data preparation
Data preparationData preparation
Data preparationJames Wong
 
Data preparation
Data preparationData preparation
Data preparationTony Nguyen
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...ImXaib
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageZurich_R_User_Group
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27IJARIIE JOURNAL
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 

Similar to Data Compression in Data mining and Business Intelligencs (20)

Student_Garden_geostatistics_course
Student_Garden_geostatistics_courseStudent_Garden_geostatistics_course
Student_Garden_geostatistics_course
 
Student_Garden_geostatistics_course
Student_Garden_geostatistics_courseStudent_Garden_geostatistics_course
Student_Garden_geostatistics_course
 
Comparison between cube techniques
Comparison between cube techniquesComparison between cube techniques
Comparison between cube techniques
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
CLIM Program: Remote Sensing Workshop, Blocking Methods for Spatial Statistic...
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Four data models in GIS
Four data models in GISFour data models in GIS
Four data models in GIS
 
Lesson13
Lesson13Lesson13
Lesson13
 

More from ShahDhruv21

Semantic net in AI
Semantic net in AISemantic net in AI
Semantic net in AIShahDhruv21
 
Error Detection & Error Correction Codes
Error Detection & Error Correction CodesError Detection & Error Correction Codes
Error Detection & Error Correction CodesShahDhruv21
 
Secure Hash Algorithm (SHA)
Secure Hash Algorithm (SHA)Secure Hash Algorithm (SHA)
Secure Hash Algorithm (SHA)ShahDhruv21
 
Data Mining in Health Care
Data Mining in Health CareData Mining in Health Care
Data Mining in Health CareShahDhruv21
 
MongoDB installation,CRUD operation & JavaScript shell
MongoDB installation,CRUD operation & JavaScript shellMongoDB installation,CRUD operation & JavaScript shell
MongoDB installation,CRUD operation & JavaScript shellShahDhruv21
 
2D Transformation
2D Transformation2D Transformation
2D TransformationShahDhruv21
 
Topological Sorting
Topological SortingTopological Sorting
Topological SortingShahDhruv21
 
Pyramid Vector Quantization
Pyramid Vector QuantizationPyramid Vector Quantization
Pyramid Vector QuantizationShahDhruv21
 
Event In JavaScript
Event In JavaScriptEvent In JavaScript
Event In JavaScriptShahDhruv21
 
WaterFall Model & Spiral Mode
WaterFall Model & Spiral ModeWaterFall Model & Spiral Mode
WaterFall Model & Spiral ModeShahDhruv21
 

More from ShahDhruv21 (12)

Semantic net in AI
Semantic net in AISemantic net in AI
Semantic net in AI
 
Error Detection & Error Correction Codes
Error Detection & Error Correction CodesError Detection & Error Correction Codes
Error Detection & Error Correction Codes
 
Secure Hash Algorithm (SHA)
Secure Hash Algorithm (SHA)Secure Hash Algorithm (SHA)
Secure Hash Algorithm (SHA)
 
Data Mining in Health Care
Data Mining in Health CareData Mining in Health Care
Data Mining in Health Care
 
MongoDB installation,CRUD operation & JavaScript shell
MongoDB installation,CRUD operation & JavaScript shellMongoDB installation,CRUD operation & JavaScript shell
MongoDB installation,CRUD operation & JavaScript shell
 
2D Transformation
2D Transformation2D Transformation
2D Transformation
 
Interpreter
InterpreterInterpreter
Interpreter
 
Topological Sorting
Topological SortingTopological Sorting
Topological Sorting
 
Pyramid Vector Quantization
Pyramid Vector QuantizationPyramid Vector Quantization
Pyramid Vector Quantization
 
Event In JavaScript
Event In JavaScriptEvent In JavaScript
Event In JavaScript
 
JSP Directives
JSP DirectivesJSP Directives
JSP Directives
 
WaterFall Model & Spiral Mode
WaterFall Model & Spiral ModeWaterFall Model & Spiral Mode
WaterFall Model & Spiral Mode
 

Recently uploaded

Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementDr. Deepak Mudgal
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 

Recently uploaded (20)

Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Ground Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth ReinforcementGround Improvement Technique: Earth Reinforcement
Ground Improvement Technique: Earth Reinforcement
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 

Data Compression in Data mining and Business Intelligencs

  • 1. A. D. Patel Institute Of Technology Data Mining And Business Intelligence (2170715): A. Y. 2019-20 Data Compression – Numerosity Reduction Prepared By : Dhruv V. Shah (160010116053) B.E. (IT) Sem - VII Guided By : Prof. Ravi D. Patel (Dept Of IT , ADIT) Department Of Information Technology A.D. Patel Institute Of Technology (ADIT) New Vallabh Vidyanagar , Anand , Gujarat 1
  • 2. Outline  Introduction  Data Reduction Strategies  Numerosity Reduction  Numerosity Reduction Methods 1) Parametric Methods 1.1) Regression 1.2) Log-Linear Model 2) Non-Parametric Methods 2.1) Histograms 2.2) Clustering 2.3) Sampling 2.4) Data Cube Aggregation.  References 2
  • 3.  Why Need Data Reduction?  A database/data warehouse may store terabytes of data.  Complex data analysis/mining may take a very long time to run on the complete data set. 3  Data Reduction: Introduction  Data Reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.  That, is Mining on the reduced data set should be more efficient yet produce the same analytical results.
  • 4. Data Reduction Strategies 4  Data cube aggregation  Attribute Subset Selection  Numerosity reduction — e.g., fit data into models  Dimensionality reduction - Data Compression  Discretization and concept hierarchy generation
  • 5. Numerosity Reduction 5  What is Numerosity Reduction?  These techniques replace the original data volume by alternative, smaller forms of data representation.  There are two techniques for numerosity reduction methods. 1) Parametric 2) Non-Parametric
  • 6. Numerosity Reduction Methods 1) Parametric Methods :  A model is used to estimate the data, so that only the data parameters need to be restored and not the actual data.  It assumes that the data fits some model estimates model parameters.  The Regression and Log-Linear methods are used for creating such models.  Regression :  Regression can be a simple linear regression or multiple linear regression.  When there is only single independent attribute, such regression model is called simple linear regression and if there are multiple independent attributes, then such regression models are called multiple linear regression.  In linear regression, the data are modeled to a fit straight line. 6
  • 7. Cont.… 7  For example, a random variable y can be modeled as a linear function of another random variable x with the equation y = ax+b ,where a and b (regression coefficients) specifies the slope and y-intercept of the line, respectively. In multiple linear regression, y will be modeled as a linear function of two or more predictor(independent) variables.  Log-Linear Model :  Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations.  This allows a higher-dimensional data space to be constructed from lower-dimensional attributes.  Regression and log-linear model can both be used on sparse data, although their application may be limited.
  • 8. 2) Non-Parametric Methods :  Do not assume the data.  These methods are used for storing reduced representations of the data include histograms, clustering, sampling and data cube aggregation. 8 Cont.… 1) Histograms :  Divide data into buckets and store average (sum) for each bucket.  Partitioning rules: 1) Equal-width: Equal bucket range 2) Equal-frequency (or equal-depth) : It uses binning to approximate data distribution
  • 9.  Binning Method :  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34  Smoothing by bin means: Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4 Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4 Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4 9 Cont.… 3) V-optimal: with the least histogram variance (weighted sum of the original values that each bucket represents) 4) MaxDiff: Consider difference between pair of adjacent values. Set bucket boundary between each pair for pairs having the β (No. of buckets)–1 largest differences
  • 10. Cont…. 10  Multi-dimensional histogram Fig. Histogram with Singleton buckets
  • 11. Cont.… 11 Fig. Equal-width Histogram  List of prices: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30
  • 12. 2) Clustering :  Clustering divides the data into groups/clusters.  This technique partitions the whole data into different clusters.  In data reduction, the cluster representation of the data are used to replace the actual data.  It also helps to detect outliers in data. 12
  • 14. 14 3) Sampling :  Sampling obtaining a small sample s to represent the whole data set N  Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data  Choose a representative subset of the data  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods  Stratified sampling  Approximate the percentage of each class (or subpopulation of interest) in the overall database.  Used in conjunction with skewed data.  Sampling may not reduce database I/Os (page at a time).
  • 15. 15 Sampling Techniques :  Simple Random Sample Without Replacement (SRSWOR)  Simple Random Sample With Replacement (SRSWR)  Cluster Sample  Stratified Sample
  • 16. Sampling Random Sample with or without Replacement Fig. SRSWOR & SRSWR 16 Raw Data
  • 17. Cluster Sample 17  Tuples are grouped into M mutually disjoint clusters  SRS of m clusters is taken where m < M  Tuples in a database retrieved in pages  Page - Cluster  SRSWOR to pages
  • 18. Stratified Sample 18  Data is divided into mutually disjoint parts called strata  SRS at each stratum  Representative samples ensured even in the presence of skewed data
  • 19. Cluster and Stratified Sampling 19 Fig. Cluster & Stratified Sampling
  • 20. Features of Sampling :  Cost depends on size of sample.  Sub-linear on size of data.  Linear with respect to dimensions.  Estimates answer to an aggregate query. 20
  • 21. 21 3) Data Cube Aggregation: :  A data cube is generally used to easily interpret data. It is especially useful when representing data together with dimensions as certain measures of business requirements.  A cube's every dimension represents certain characteristic of the database.  Data Cubes store multidimensional aggregated information.  Data cubes provide fast access to precomputed, summarized data, thereby benefiting online analytical processing (OLAP) as well as data mining.
  • 22. 22 Categories of Data Cube :  Dimensions:  Represents categories of data such as time or location.  Each dimension includes different levels of categories.  Example :
  • 23. 23 Categories of Data Cube :  Measures:  These are the actual data values that occupy the cells as defined by the dimensions selected.  Measures include facts or variables typically stored as numerical fields.  Example :
  • 25. 25