SlideShare a Scribd company logo
1 of 105
Download to read offline
Unit 2 : Data Preprocessing
Presented By : Tekendra Nath Yogi
Tekendranath@gmail.com
College Of Applied Business And Technology
Contd…
• Outline:
– Data types and attribute types
– Data pre- processing
– OLAP
– Characteristics of OLAP Systems
– Multidimensional views and data cubes
– Data cube implementations
– Data cube operations
– Guidelines for OLAP Implementation.
26/6/2019 By: Tekendra Nath Yogi
Introduction
• Variety of data and variety of data mining tools exist.
• Need to know about the data in order to select proper data mining tool.
• So, make closer look at attributes and data values
36/6/2019 By: Tekendra Nath Yogi
Contd….
• What is Data?
46/6/2019 By: Tekendra Nath Yogi
• Collection of data objects and their
attributes
• An attribute is a property or characteristic
of an object
– Examples: eye color of a person,
temperature, etc.
– Attribute is also known as variable,
field, characteristic, dimension, or
feature
• A collection of attributes describe an
object
– Object is also known as record, point,
case, sample, entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes/ dimension
Objects
Contd….
• Attribute Values:
– Attribute values are numbers or symbols assigned to an attribute for a
particular object.
– Distinction between attributes and attribute values
• Same attribute can be mapped to different attribute values
– Example: height can be measured in feet or meter
• Different attributes can be mapped to the same set of values
– Example: Attribute values for ID and age are integers
– But properties of attribute values can be different
56/6/2019 By: Tekendra Nath Yogi
Contd….
• Types of Attributes :
– The type of attribute is determined by the set of possible values the attribute
can have.
– There are different types of attributes:
• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numeric attributes
– Interval-scaled attributes
– Ratio-scaled attributes
66/6/2019 By: Tekendra Nath Yogi
Contd..
• Nominal Attributes:
– Nominal means “relating to names.”
– The values of a nominal attribute are symbols or names of
things.
– Each value represents some kind of category, code, or state, and
so nominal attributes are also referred to as categorical.
– The values do not have any meaningful order.
– E.g., : Hair_color = { black, brown, grey, red, white, etc}
Marital _status= { single, married, divorced}
June 6, 2019 7By: Tekendra Nath Yogi
Contd..
• Binary Attributes:
– A binary attribute is a nominal attribute with only two categories or states:
0 or 1, where 0 typically means that the attribute is absent, and 1 means
that it is present.
– Symmetric binary: both outcomes equally important
• e.g., gender = {male ,female}
– Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome(rarest)
(e.g., HIV positive) and other by 0(e.g., HIV negative)
June 6, 2019 8By: Tekendra Nath Yogi
Contd..
• Ordinal Attribute:
– An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
– E.g.: suppose that drink size corresponds to the size of drinks available at
a fast-food restaurant. This nominal attribute has three possible values:
small, medium, and large. The values have a meaningful sequence (which
corresponds to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large.
June 6, 2019 9By: Tekendra Nath Yogi
Contd..
• Numeric Attributes:
– It is a measurable quantity, represented in integer or real values.
– Numeric attributes can be interval-scaled or ratio-scaled.
– Interval-scaled :
• Interval-scaled attributes are measured on a scale of equal-size units.
• E.g.: calendar dates: For instance, the years 2002 and 2010 are eight years apart.
– Ratio-scaled attributes:
• a value as being a multiple (or ratio) of another value.
• examples of ratio-scaled attributes include count attributes such as years
of experience (e.g., the objects are employees)
June 6, 2019 10By: Tekendra Nath Yogi
Discrete vs. Continuous Attributes
• There are many ways to organize attribute types. The types are not
mutually exclusive.
• Discrete Attribute
– Has only a finite or countably infinite set of values
• E.g., Roll number, hari_colour, or the set of words in a collection of
documents
– Sometimes, represented as integer variables
– Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
– Has real numbers as attribute values
• E.g., temperature, Speed, etc
– Practically, real values can only be measured and represented using a finite
number of digits
– Continuous attributes are typically represented as floating-point variables
June 6, 2019 11By: Tekendra Nath Yogi
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Social or information networks
• Ordered
– Video data
– Temporal Data
– Sequential Data
June 6, 2019 12By: Tekendra Nath Yogi
Important Characteristics of Data
– Dimensionality (number of attributes)
• High dimensional data brings a number of challenges
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
– Size
• Type of analysis may depend on size of data
June 6, 2019 13By: Tekendra Nath Yogi
Record Data
• Data that consists of a collection of records, each of which consists of a
fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
June 6, 2019 14By: Tekendra Nath Yogi
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional space,
where each dimension represents a distinct attribute
• Such data set can be represented by an m by n matrix, where there
are m rows, one for each object, and n columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
ThicknessLoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
ThicknessLoadDistanceProjection
of y load
Projection
of x Load
June 6, 2019 15By: Tekendra Nath Yogi
Document Data
• Each document becomes a „term‟ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
June 6, 2019 16By: Tekendra Nath Yogi
Transaction Data
• A special type of record data, where
– Each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction, while
the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
June 6, 2019 17By: Tekendra Nath Yogi
Graph Data
• Examples:
– Generic graph
– World-wide web
– Social or information networks
5
2
1
2
5
June 6, 2019 18By: Tekendra Nath Yogi
Ordered Data
• Video data: sequence of images
• Temporal data: time-series
• Sequential Data: transaction sequences
June 6, 2019 19By: Tekendra Nath Yogi
Data Preprocessing
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data e.g., occupation=“ ”
– noisy: containing errors or outliers e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names e.g., Age=“62”
Birthday=“03/07/1997”
• No quality data, no quality mining results!
• So, data should be preprocessed to make it ready for quality mining
206/6/2019 By: Tekendra Nath Yogi
Contd….
• Major Tasks in Data Preprocessing:
– Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
– Data integration
• Integration of multiple databases, data cubes, or files
– Data transformation
• Normalization and aggregation
– Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
– Data discretization
• where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth,
adult, senior).
216/6/2019 By: Tekendra Nath Yogi
Contd….
226/6/2019 By: Tekendra Nath Yogi
1. Data Cleaning
• If data is dirty(incomplete, noisy, inconsistent) then:
– Can cause confusion for the data mining procedure, resulting in
unreliable output.
– Users can not trust the any results of data mining
• So, data cleaning is required.
• To clean data the following data cleaning tasks are performed:
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
June 6, 2019 23By: Tekendra Nath Yogi
Contd…
• Missing Data:
– Data may not be always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
– Missing data may be due to:
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
– Missing data may need to be inferred.
June 6, 2019 24By: Tekendra Nath Yogi
Contd….
• How to Handle Missing Data?
– Ignore the tuple:
• usually done when class label is missing
– Fill in the missing value manually:
• tedious + infeasible?
– Use a global constant to fill in the missing value:
• e.g., “unknown”
– Use the attribute mean to fill in the missing value
– Use the most probable value to fill in the missing value:
June 6, 2019 25By: Tekendra Nath Yogi
Contd…
• Noise:
– Random error or variance in a measured variable
– Noise (Incorrect attribute) values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
June 6, 2019 26By: Tekendra Nath Yogi
Contd….
• How to Handle Noisy Data?
– Binning method:
• first sort data and partition into bins
• then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
– Clustering
• detect and remove outliers
– Combined computer and human inspection
• detect doubtful values and check by human
– Regression
• smooth by fitting the data into regression functions
June 6, 2019 27By: Tekendra Nath Yogi
Binning
• Three step process:
– Sort the data
– Make the bins by partitioning
– Smooth the data in each bins
June 6, 2019 28By: Tekendra Nath Yogi
Contd…
• Partitioning techniques to make bins:
– Equal-width (distance) partitioning:
– Equal-depth (frequency) partitioning
June 6, 2019 29By: Tekendra Nath Yogi
Contd…
– Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B-A)/N.
– Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately
same number of samples
June 6, 2019 30By: Tekendra Nath Yogi
Example: Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
June 6, 2019 31By: Tekendra Nath Yogi
2. Data Integration
• Data integration combines data from multiple sources into a
coherent store(e.g., DW).
• Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
• This can help improve the accuracy and speed of the subsequent
data mining process.
June 6, 2019 32By: Tekendra Nath Yogi
Contd..
• But, Entity identification problem:
• for the same real world entity, attribute values from different sources
are different.
• So how to identify these two attributes are for the same real world
entity.
• For example: how can the data analyst or the computer be sure that
customer id in one database and cust_number in another refer to the
same attribute?
• So how to transform one attribute into the another attribute during
integration?
• Solution: The Meta data can be used to help the transformation of data
June 6, 2019 33By: Tekendra Nath Yogi
3. Data Transformation
• In this preprocessing step, the data are transformed or consolidated so that
the resulting mining process may be more efficient, and the patterns found
may be easier to understand.
• Data Transformation Strategies:
– Aggregation: For example, the daily sales data may be aggregated so
as to compute monthly and annual total amounts.
– Normalization: where the attribute data are scaled so as to fall within
a smaller range, such as: -1.0 to 1.0, or 0.0 to 1.0.
June 6, 2019 34By: Tekendra Nath Yogi
4. Data Reduction
• Warehouse may store terabytes of data:
– Complex data mining may take a very long time to run on the complete
data set.
• Data reduction
– Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results.
June 6, 2019 35By: Tekendra Nath Yogi
Data Reduction Strategies
• Data reduction strategies
– Data cube aggregation
– Dimensionality reduction
– Histograms
– clustering
– sampling
June 6, 2019 36By: Tekendra Nath Yogi
Contd..
• Data Cube Aggregation:
– Data for sales per quarter, for the years 2008 to 2010.
– interested in the annual sales (total per year), rather than the total per
quarter.
– Thus, the data can be aggregated so that the resulting data summarize the
total sales per year instead of per quarter.
– The resulting data set is smaller in volume, without loss of information
necessary for the analysis task.
– This aggregation is illustrated in Figure below.
June 6, 2019 37By: Tekendra Nath Yogi
Contd..
– Dimensionality Reduction:
• To reduce the dimensionality perform Feature selection (i.e., attribute
subset selection).
• I.e., create a data set containing only the relevant attributes for a
current analysis.
• that reduces the number of patterns in the result of data mining, so
that it becomes easier to understand
June 6, 2019 38By: Tekendra Nath Yogi
Contd..
For Example
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
June 6, 2019 39By: Tekendra Nath Yogi
Histograms
• A popular data reduction technique
• Divide data into buckets and store average (sum) for each bucket
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
June 6, 2019 40By: Tekendra Nath Yogi
Clustering
• Partition data set into clusters, and one can store cluster
representation only.
June 6, 2019 41By: Tekendra Nath Yogi
Sampling
• Sampling is the main technique employed for data reduction.
– It is often used for both the preliminary investigation of the data
and the final data analysis.
• Statisticians often sample because obtaining the entire set of
data of interest is too expensive or time consuming.
• Sampling is typically used in data mining because processing
the entire set of data of interest is too expensive or time
consuming.
June 6, 2019 42By: Tekendra Nath Yogi
Contd…
• The key principle for effective sampling is the following:
– Using a sample will work almost as well as using the entire data set, if
the sample is representative
– A sample is representative if it has approximately the same properties
(of interest) as the original set of data
June 6, 2019 43By: Tekendra Nath Yogi
Data Warehouse: A Three-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
Other
sources
Data Storage
OLAP Server
June 6, 2019 44By: Tekendra Nath Yogi
OLAP
• OLAP is a software technology concerned with fast analysis of enterprise
information.
• Often OLAP systems are data warehouse front end software tools to make
aggregate data available efficiently to an enterprise‟s decision makers
(analysts, managers and executives).
• Major OLAP applications are trend analysis over a number of time periods,
slicing, dicing , drill-down and roll-up to look at the data at different levels
of detail and pivoting or rotating to obtain a new multidimensional view.
June 6, 2019 45By: Tekendra Nath Yogi
Characteristics of OLAP Systems
• Users:
– OLAP systems are designed for decision makers. Therefore, an OLAP
system is likely to be accessed only by a selected group of managers and
may be used by dozens of users.
• Functions:
– OLAP systems are management critical to support an enterprise's decision
support functions using analytical investigations
466/6/2019 By: Tekendra Nath Yogi
Contd….
• Nature:
– Nature of usage of OLTP system is repetitive
– Nature of usage of OLAP system is mostly ad hoc
• Design:
– OLAP systems are designed to be subject-oriented.
– OLAP systems view enterprise information as multidimensional.
476/6/2019 By: Tekendra Nath Yogi
Contd….
• Data:
– OLAP systems require historical data over several years since trend are
often important in decision making.
• Kinds of use:
– OLAP systems normally do not update the data but refresh the data.
486/6/2019 By: Tekendra Nath Yogi
FASMI Characteristics of OLAP systems
• The FASMI characteristics of OLAP systems, the name
derived from the first letters of the characteristics, are:
– Fast
– Analytic
– Shared
– Multidimensional
– Information
June 6, 2019 49By: Tekendra Nath Yogi
Contd..
• Fast:
– OLAP queries should be answered very quickly, perhaps
within seconds.
– To achieve such performance:
• the data structure must be efficient and the hardware must be
powerful.
• Full pre-computation of aggregates
• Pre-compute the most commonly queried aggregates.
June 6, 2019 50By: Tekendra Nath Yogi
Contd..
• Analytic:
– An OLAP system must provide rich analytic functionality and it is
expected that most OLAP queries can be answered without any
programming.
– The system should be able to cope with any relevant queries for the
application and the user.
June 6, 2019 51By: Tekendra Nath Yogi
Contd..
• Shared:
– An OLAP system is a shared resource although it is unlikely to be
shared by hundreds of users.
– An OLAP system is likely to be accessed only by a selected group of
managers and may be use by mere dozens of users.
– Being a shared system, an OLAP system should provide adequate
security for confidentiality as well as integrity.
June 6, 2019 52By: Tekendra Nath Yogi
Contd..
• Multidimensional:
– This is the basic requirement.
– Whatever OLAP software is being used, it must provide a
multidimensional conceptual view of the data.
June 6, 2019 53By: Tekendra Nath Yogi
Contd..
• Information:
– OLAP systems usually obtain information from a data warehouse.
– The system should be able to handle a large amount of input data.
– The capacity of an OLAP system to handle information and its
integration with the data warehouse may be critical.
June 6, 2019 54By: Tekendra Nath Yogi
Codd’s OLAP characteristics
• The most important characteristics of OLAP systems provided by the
Codd are as follows:
– Multidimensional conceptual view
– Accessibility(OLAP as a mediator)
– Batch extraction vs interpretive
– Multi-user support
– Storing OLAP result
– Extraction of missing values
– Treatment of missing values
– Uniform reporting performance
– Generic dimensionality
– Unlimited dimensions and aggregation levels
June 6, 2019 55By: Tekendra Nath Yogi
Contd..
• Multidimensional conceptual view:
– By requiring a multidimensional view, it is possible to carry out
operations like slice and dice.
• Accessibility (OLAP as a mediator):
– The OLAP software should be sitting between data sources(e..g., a data
warehouse) and an OLAP front- end.
June 6, 2019 56By: Tekendra Nath Yogi
Contd..
• Batch extraction versus interpretive:
– An OLAP system should provide multidimensional data staging plus
partial pre-calculation of aggregates in large multidimensional
databases.
• Multi- user support:
– Since the OLAP system is shared, the OLAP software should provide
many normal database operations including retrieval, update,
concurrency control, integrity and security.
June 6, 2019 57By: Tekendra Nath Yogi
Contd..
• Storing OLAP results:
– OLAP results data should be kept separate from source data.
• Extraction of missing values:
– The OLAP system should distinguish missing values form zero values.
– A large data cube may have a large number of zeros as well as some
missing values.
– If a distinction is not made between zero values and missing values, the
aggregates are likely to be computed incorrectly.
June 6, 2019 58By: Tekendra Nath Yogi
Contd..
• Treatment of missing values:
– An OLAP system should ignore all missing values regardless of their
source.
– Correct aggregate values will be computed once the missing values are
ignored.
• Uniform reporting performance:
– Increasing the number of dimensions or database size should not
significantly degrade the reporting performance of the OLAP system.
– This is good objective although it may be difficult to achieve in
practice.
June 6, 2019 59By: Tekendra Nath Yogi
Contd..
• Generic dimensionality:
– An OLAP system should treat each dimension as equivalent in both its
structure and operational capabilities.
• Unlimited dimensions and aggregation levels:
– An OLAP system should allow unlimited dimensions and aggregations
and aggregation levels.
– but In practice, this is undesirable.
June 6, 2019 60By: Tekendra Nath Yogi
Multidimensional data model
• Data warehouses and OLAP tools are based on a multidimensional data
model.
• This model views data in the form of a data cube(models n-dimensional
data).
• What is a data cube?
– The data cube is a metaphor for multidimensional data storage.
• A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts.
– Usually cubes are 3-D geometric structures, But in data warehousing
the data cube is n-dimensional and do not confine data to 3-D.
616/6/2019 By: Tekendra Nath Yogi
Contd…
• Dimensions:
– dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
– For example: In a sales data warehouse for a store dimensions can be: time,
item, branch, and location.
– These dimensions allow the store to keep track of things like monthly sales
of items and the branches and locations at which the items were sold.
• Dimension table:
– Each dimension may have a table associated with it, called dimension table,
which further describe the dimension.
– E.g., item (item_name, brand, type), or time(day, week, month, quarter, year)
626/6/2019 By: Tekendra Nath Yogi
Contd…
• Fact and Fact table:
– A multidimensional data model is usually organized around a central
theme (e.g., sales).
– Numeric measures on this theme are called facts, and they are used to
analyze the relationships between the dimensions.
– The fact table contains the names of the facts, or measures (such as
dollars_sold) , as well as keys to each of the related dimension tables.
636/6/2019 By: Tekendra Nath Yogi
Contd..
• a simple 2-D data cube: a table or spreadsheet
• E.g.,
June 6, 2019 64By: Tekendra Nath Yogi
Contd..
• 3-D data cube: a set of similarly structured 2-D tables stacked on top of one
another.
• E.g.,
June 6, 2019 65By: Tekendra Nath Yogi
Contd..
• The 3-D data in the table are represented as a series of 2-D tables called 3-D data cube,
as in Figure below.
• Fig: A 3-D data cube representation of the data in Table previous slide, according to
time, item, and location.
June 6, 2019 66By: Tekendra Nath Yogi
Contd..
• 4-D cubes: a 4-D cube is a series of 3-D cubes, as shown in Figure below:
• in this way, we may display any n-dimensional data as a series of (n-1)-
Dimensional “cubes.”
June 6, 2019 67By: Tekendra Nath Yogi
Data Cube implementation
• Data warehouses contain huge volumes of data. OLAP servers demand that
decision support queries to answered in the order of seconds. It is crucial for
data warehouse systems to support highly efficient cube computation
techniques, access methods and query processing techniques.
• Efficient data cube computation:
– No Materialization
– Full Materialization
– Partial Materialization
• Access methods: How OLAP data can be indexed(Bit map and join indices)
• Query processing technique
• OLAP server types
– ROLAP
– MOLAP
– HOLAP
June 6, 2019 68By: Tekendra Nath Yogi
Contd…..
• Cube: A Lattice of Cuboids
– Given a set of dimensions, we can generate a cuboid for each of the
possible subsets of the given dimensions.
– The result would form a lattice of cuboids, each showing the data at a
different level of summarization(or group-by/aggregation).
– In data warehousing literature, the most detailed part of the cube is called
a base cuboid. The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid. The lattice of cuboids forms a
data cube.
696/6/2019 By: Tekendra Nath Yogi
Contd…
• Example:
– Suppose that you would like to create a data cube for a sales store that
contains : city, item, year, as the dimensions for the data cube and
sales_in_dollars as the measure.
– The possible group-by‟s are the following:
• {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ( )},
– These group-by‟s form a lattice of cuboids for the data cube, as shown
in Figure below.
706/6/2019 By: Tekendra Nath Yogi
Contd..
June 6, 2019 71By: Tekendra Nath Yogi
Contd…
• Two special types of cuboids:
– Base Cuboid:
• The base cuboid contains the all data for any combination of the given
n-dimensions.
• So, the most detailed part of the cube is called a base cuboid.
• The base cuboid is the least generalized(most specific) of the cuboids.
– Apex cuboid:
• The top most 0-D cuboid, which holds the highest level of
summarization is called the apex cuboid.
• The apex cuboid is the most general(least specific) of the cuboids and
is often denotd as all.
726/6/2019 By: Tekendra Nath Yogi
• Curse of Dimensionality:
• OLAP may need to access different cuboids for different queries.
• a good idea :
• compute all or at least some of the cuboids in a data cube in advance.
• Pre-computation leads to fast response time and
• avoids some redundant computation.
• But, required storage space may explode (due to pre-computation of
all cuboid, large number of dimensions and large number of concept
hierarchies of dimensins)
• This problem is referred to as the curse of dimensionality
Contd…
There are three choices for data cube materialization(computation of
cuboids) given a base cuboid:
1. No Materialization
2. Full Materialization
3. Partial Materialization
Data cube Materialization
•No Materialization
Do not pre-compute any of the “non-base” cuboid.
This leads to computing expensive multidimensional aggregates on the fly,
which can be extremely slow.
Contd..
•Full Materialization
•Pre-compute all of the cuboids.
•The resulting lattice of computed cuboids is referred to as the full cube.
•This choice typically requires huge amounts of memory space in order to
store all of the pre-computed cuboids.
Contd..
•Partial Materialization
Selectively compute a proper subset of the whole set of possible cuboids.
It represents an interesting trade-off between storage space and response
time.
The partial materialization of cuboids or sub-cubes should consider three
factors:
 Identify the subset of cuboids or sub-cubes to materialize
 Exploit the materialized cuboids or sub-cubes during query
processing
 Efficiently update the materialized cuboid or sub-cubes during load
and refresh.
Contd..
Indexing OLAP DATA
• To facilitate efficient data accessing to further speed up query
processing.
• Two most commonly used methods
• The Bitmap indexing method and
• Join indexing method
• Bitmap Indexing:
• In the bitmap index for a given attribute, there is a distinct bit vector, Bv,
for each value v in the domain of the attribute.
• If the attribute has the value v for a given row in the data table, then the bit
representing that value is set to 1 in the corresponding row of the bitmap
index. All other bits for that row are set to 0.
• Bitmap indexing reduces join, aggregation, and comparison operations to
bit arithmetic.
Contd..
Figure below shows a base (data) table containing the dimensions item and city,
and its mapping to bitmap index tables for each of the dimensions.
Contd..
• Join indexing:
• The join indexing method gained popularity from its use in
relational database query processing.
• Join indexing registers the joinable rows of two relations from a
relational database.
• Hence, the join index records can identify joinable tuples
without performing costly join operations.
Contd..
Example: join index relationship between the sales fact table and the location and
item dimension tables is shown in figure below
Contd..
Here, the “Main Street” value in the location dimension table joins with tuples T57,
T238, and T884 of the sales fact table.
Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and
T459 of the sales fact table.
Contd..
The corresponding join index tables are shown in Figure below.
•The purpose of materializing cuboids and constructing OLAP
index structures is to speed up query processing in data cubes.
Given materialized views, query processing should proceed as
follows:
1. Determine which operations should be performed on the available
cuboids.
2. Determine to which materialized cuboid(s) the relevant operations
should be applied.
Efficient Processing of OLAP Queries
Types of OLAP Servers
• OLAP servers present business users with multidimensional data from data
warehouses, without concerns regarding how or where the data are stored.
• However, the physical architecture and implementation of OLAP servers
must consider data storage issues.
• Implementations of a warehouse server for OLAP processing include the
following:
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
June 6, 2019 85By: Tekendra Nath Yogi
Contd..
• Relational OLAP (ROLAP) Server:
– These are the intermediate servers that stand in between a relational back-
end server and client front-end tools.
– They use a relational or extended-relational DBMS to store and manage
warehouse data, and OLAP middleware to support missing pieces.
– ROLAP servers include optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and
services.
– ROLAP technology tends to have greater scalability than MOLAP
technology.
June 6, 2019 86By: Tekendra Nath Yogi
Contd…
876/6/2019 By: Tekendra Nath Yogi
Contd..
• Multidimensional OLAP (MOLAP) Server:
– These servers supports multidimensional views of data through array-based
multidimensional storage engines.
– They map multidimensional views directly to data cube array structures.
– The advantages of using a data cube is that it allows fast indexing to pre-
computed summarized data.
– In multidimensional data stores, the storage utilization may be low if the
data set is sparse.
June 6, 2019 88By: Tekendra Nath Yogi
Contd…
896/6/2019 By: Tekendra Nath Yogi
Contd..
• Hybrid OLAP (HOLAP) Servers:
• The hybrid OLAP approach combines ROLAP and MOLAP technology.
• Benefiting from the greater scalability of ROLAP and the faster computation of
MOLAP.
June 6, 2019 90By: Tekendra Nath Yogi
Contd..
• MOLAP vs. ROLAP:
June 6, 2019 91By: Tekendra Nath Yogi
MOLAP ROLAP
Information retrieval is fast. Information retrieval is comparatively
slow.
Uses sparse array to store data-sets. Uses relational table.
MOLAP is best suited for
inexperienced users, since it is very
easy to use.
ROLAP is best suited for experienced
users.
Maintains a separate database for
data cubes.
It may not require space other than
available in the Data warehouse.
DBMS facility is weak. DBMS facility is strong.
Data Cube operations(OLAP operations)
• A number of operations may be applied to data cubes for
OLAP.
• The common ones are:
– Slice
– dice
– Roll-up(Drill-up)
– Drill-down(Roll-down)
– Pivot(Rotate)
June 6, 2019 92By: Tekendra Nath Yogi
Contd..
• The cube contains the dimensions: location, time,
and item, where location is aggregated with
respect to city values, time is aggregated with
respect to quarters, and item is aggregated with
respect to item types.
– The measure displayed is dollars sold (in
thousands).
– The data examined are for the cities Chicago,
New York, Toronto, and Vancouver.
June 6, 2019 93By: Tekendra Nath Yogi
A data cube for sales store to illustrate data cube operation:
Contd..
• Slice:
– Slice operation performs a
selection on one dimension of the
given cube, thus creates sub-cube
of a cube.
– Below example depicts how slice
operation works- Where the sales
data are selected from the central
cube for the dimension time using
the criterion time=“Q1”
June 6, 2019 94By: Tekendra Nath Yogi
Contd..
• Dice:
– Dice operation performs a
selection on two or more
dimension from a given cube and
creates a sub-cube.
• Below example depicts how Dice
operation works- based on the
following selection criteria:
(location = “Toronto” or “Vancouver”) and
(time = “Q1” or “Q2”) and (item =
“Mobile” or “Modem”).
June 6, 2019 95By: Tekendra Nath Yogi
Contd..
• Roll-up(Drill-up):
– The roll-up operation
performs aggregation on a
data cube, either :
• by climbing up a concept
hierarchy for a dimension
or
• by dimension reduction.
– Below example depicts how
roll-up operation works-
June 6, 2019 96By: Tekendra Nath Yogi
Contd..
• Drill-down(Roll-down):
– Drill-down is the reverse operation of
roll-up. It is performed by either of the
following ways:
• By stepping down a concept hierarchy for a
dimension
• By introducing a new dimension.
– It allows users to navigate among
different levels of data i.e. most
summarized (up) to most details (down).
– Below example depicts how Drill-down
operation works
June 6, 2019 97By: Tekendra Nath Yogi
Contd..
• Pivot:
– Pivot also known as
rotation changes the
dimensional rotation of the
cube, i.e. rotates the axes to
view the data from
different perspectives. The
below cubes shows 2D
representation of Pivot
June 6, 2019 98By: Tekendra Nath Yogi
Guidelines for OLAP implementation
• A number of Guidelines for successful implementation of
OLAP are as follows:
– Vision
– Senior management support
– Selecting an OLAP tool
– Corporate strategy
– Focus on the users
– Joint management
– Review and adapt
June 6, 2019 99By: Tekendra Nath Yogi
Contd..
• Vision:
– The OLAP team must, in consultation with the users, develop a clear
vision for the OLAP system. This vision including the business
objectives should be clearly defined, understood, and shared by the
stakeholders.
• Senior management support:
– The OLAP project should fully supported by the senior managers,
since a data warehouse may have been developed already this should
not be difficult.
June 6, 2019 100By: Tekendra Nath Yogi
Contd..
• Selecting an OLAP tool:
– The OLAP team should familiarize themselves with the ROLP and
MOLAP tools available in the market. Since tools are quite different,
careful planning may be required in selecting a tool that is appropriate
for the enterprise. In some situations, a combination of ROLAP and
MOLAP may be most effective.
• Corporate strategy:
– The OLAP strategy should fit with the enterprise strategy and business
objectives. A good fit will result in the OLAP tools being used more
widely.
June 6, 2019 101By: Tekendra Nath Yogi
Contd..
• Focus on users:
– The OLAP project should be focused on users. Users should, in
consultation with the technical professionals, decide what tasks will be
done first and what will be done later. Attempts should be made to
provide each user with a tool suitable for that person‟s skill level and
information needs. A good GUI user interface should be provided to
non-technical users. The project can only be successful whit the full
support of the users.
June 6, 2019 102By: Tekendra Nath Yogi
Contd..
• Joint Management:
– The OLAP project must be managed by both the IT and business
professional. Many other people should be involved in supplying ideas.
An appropriate committee structure may be necessary to channel these
ideas
• Review and adapt:
– Organizations evolve and so must be OLAP system. Regular reviews of
the project may be required to ensure that the project is meeting the
current needs of the enterprise.
June 6, 2019 103By: Tekendra Nath Yogi
Home Work
• What are dimension, members, measure and fact table?
• List the major difference between OLTP systems and OLAP systems.
• What is OLAP and its purpose? List the characteristics of OLAP systems.
• What is data cube and purpose of data cube? Use an example to illustrate
the use of data cube.
• What are ROLAP and MOLAP ?describe the two approaches and list their
advantages and disadvantages.
• Describe the operations(OLAP/ Cube operations) roll-up, drill-down, and
slice and dice.
• List the implementation guidelines for implementing OLAP.
June 6, 2019 104By: Tekendra Nath Yogi
Thank You !
105By: Tekendra Nath Yogi6/6/2019

More Related Content

What's hot

Data mining Concepts and Techniques
Data mining Concepts and Techniques Data mining Concepts and Techniques
Data mining Concepts and Techniques Justin Cletus
 
Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databasesyazad dumasia
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...DATAVERSITY
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHoang Nguyen
 
Computation systems for protecting delimited data
Computation systems for protecting delimited dataComputation systems for protecting delimited data
Computation systems for protecting delimited dataG Prachi
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature SelectionAndres Mendez-Vazquez
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 ClassificationKhalid Elshafie
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data MiningRashmi Bhat
 

What's hot (20)

Data mining Concepts and Techniques
Data mining Concepts and Techniques Data mining Concepts and Techniques
Data mining Concepts and Techniques
 
Data visualization
Data visualizationData visualization
Data visualization
 
Schemas for multidimensional databases
Schemas for multidimensional databasesSchemas for multidimensional databases
Schemas for multidimensional databases
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
Self-Service Data Analysis, Data Wrangling, Data Munging, and Data Modeling –...
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Computation systems for protecting delimited data
Computation systems for protecting delimited dataComputation systems for protecting delimited data
Computation systems for protecting delimited data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
22 Machine Learning Feature Selection
22 Machine Learning Feature Selection22 Machine Learning Feature Selection
22 Machine Learning Feature Selection
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
3 data visualization
3 data visualization3 data visualization
3 data visualization
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
 

Similar to BIM Data Mining Unit2 by Tekendra Nath Yogi

Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big dataSeta Wicaksana
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).pptMdZahidHasan55
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 
Data to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science LeadData to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science LeadPromotable
 
Statistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxStatistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxJayaprakashGururaj
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxCloudBusiness2
 
BIM Data Mining Unit5 by Tekendra Nath Yogi
 BIM Data Mining Unit5 by Tekendra Nath Yogi BIM Data Mining Unit5 by Tekendra Nath Yogi
BIM Data Mining Unit5 by Tekendra Nath YogiTekendra Nath Yogi
 
Approximate Query Processing
Approximate Query ProcessingApproximate Query Processing
Approximate Query ProcessingDeepak Goyal
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxAmjadAlDgour
 
Social Media Mining - Chapter 5 (Data Mining Essentials)
Social Media Mining - Chapter 5 (Data Mining Essentials)Social Media Mining - Chapter 5 (Data Mining Essentials)
Social Media Mining - Chapter 5 (Data Mining Essentials)SocialMediaMining
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataDATAVERSITY
 
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
CIS375 Interaction Designs Chapter8
CIS375 Interaction Designs Chapter8CIS375 Interaction Designs Chapter8
CIS375 Interaction Designs Chapter8Dr. Ahmed Al Zaidy
 
Lessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMLessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMDATAVERSITY
 
Data Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-ServiceData Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-ServiceDATAVERSITY
 
Statistics for business decisions
Statistics for business decisionsStatistics for business decisions
Statistics for business decisionsYeshwanth Gowda
 
DataEd Slides: Getting Started with Data Stewardship
DataEd Slides:  Getting Started with Data StewardshipDataEd Slides:  Getting Started with Data Stewardship
DataEd Slides: Getting Started with Data StewardshipDATAVERSITY
 

Similar to BIM Data Mining Unit2 by Tekendra Nath Yogi (20)

Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
Data to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science LeadData to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science Lead
 
Statistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptxStatistical Learning - Introduction.pptx
Statistical Learning - Introduction.pptx
 
Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 
BIM Data Mining Unit5 by Tekendra Nath Yogi
 BIM Data Mining Unit5 by Tekendra Nath Yogi BIM Data Mining Unit5 by Tekendra Nath Yogi
BIM Data Mining Unit5 by Tekendra Nath Yogi
 
Approximate Query Processing
Approximate Query ProcessingApproximate Query Processing
Approximate Query Processing
 
lec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptxlec01-IntroductionToDataMining.pptx
lec01-IntroductionToDataMining.pptx
 
Social Media Mining - Chapter 5 (Data Mining Essentials)
Social Media Mining - Chapter 5 (Data Mining Essentials)Social Media Mining - Chapter 5 (Data Mining Essentials)
Social Media Mining - Chapter 5 (Data Mining Essentials)
 
Emerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big DataEmerging Data Quality Trends for Governing and Analyzing Big Data
Emerging Data Quality Trends for Governing and Analyzing Big Data
 
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
CIS375 Interaction Designs Chapter8
CIS375 Interaction Designs Chapter8CIS375 Interaction Designs Chapter8
CIS375 Interaction Designs Chapter8
 
Lessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDMLessons in Data Modeling: Data Modeling & MDM
Lessons in Data Modeling: Data Modeling & MDM
 
Data Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-ServiceData Catalogues - Architecting for Collaboration & Self-Service
Data Catalogues - Architecting for Collaboration & Self-Service
 
Statistics for business decisions
Statistics for business decisionsStatistics for business decisions
Statistics for business decisions
 
big data.pptx
big data.pptxbig data.pptx
big data.pptx
 
DataEd Slides: Getting Started with Data Stewardship
DataEd Slides:  Getting Started with Data StewardshipDataEd Slides:  Getting Started with Data Stewardship
DataEd Slides: Getting Started with Data Stewardship
 

More from Tekendra Nath Yogi

Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge RepresentationTekendra Nath Yogi
 
Unit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchUnit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchTekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath YogiTekendra Nath Yogi
 
B. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath YogiB. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath YogiTekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath YogiTekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath YogiTekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath YogiTekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath YogiTekendra Nath Yogi
 

More from Tekendra Nath Yogi (20)

Unit9:Expert System
Unit9:Expert SystemUnit9:Expert System
Unit9:Expert System
 
Unit7: Production System
Unit7: Production SystemUnit7: Production System
Unit7: Production System
 
Unit8: Uncertainty in AI
Unit8: Uncertainty in AIUnit8: Uncertainty in AI
Unit8: Uncertainty in AI
 
Unit5: Learning
Unit5: LearningUnit5: Learning
Unit5: Learning
 
Unit4: Knowledge Representation
Unit4: Knowledge RepresentationUnit4: Knowledge Representation
Unit4: Knowledge Representation
 
Unit3:Informed and Uninformed search
Unit3:Informed and Uninformed searchUnit3:Informed and Uninformed search
Unit3:Informed and Uninformed search
 
Unit2: Agents and Environment
Unit2: Agents and EnvironmentUnit2: Agents and Environment
Unit2: Agents and Environment
 
Unit1: Introduction to AI
Unit1: Introduction to AIUnit1: Introduction to AI
Unit1: Introduction to AI
 
Unit 6: Application of AI
Unit 6: Application of AIUnit 6: Application of AI
Unit 6: Application of AI
 
Unit10
Unit10Unit10
Unit10
 
Unit9
Unit9Unit9
Unit9
 
Unit8
Unit8Unit8
Unit8
 
Unit7
Unit7Unit7
Unit7
 
Unit6
Unit6Unit6
Unit6
 
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 5 By Tekendra Nath Yogi
 
B. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath YogiB. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Lab By Tekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 4 By Tekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 3 By Tekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 2 By Tekendra Nath Yogi
 
B. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath YogiB. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath Yogi
B. SC CSIT Computer Graphics Unit 1.3 By Tekendra Nath Yogi
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

BIM Data Mining Unit2 by Tekendra Nath Yogi

  • 1. Unit 2 : Data Preprocessing Presented By : Tekendra Nath Yogi Tekendranath@gmail.com College Of Applied Business And Technology
  • 2. Contd… • Outline: – Data types and attribute types – Data pre- processing – OLAP – Characteristics of OLAP Systems – Multidimensional views and data cubes – Data cube implementations – Data cube operations – Guidelines for OLAP Implementation. 26/6/2019 By: Tekendra Nath Yogi
  • 3. Introduction • Variety of data and variety of data mining tools exist. • Need to know about the data in order to select proper data mining tool. • So, make closer look at attributes and data values 36/6/2019 By: Tekendra Nath Yogi
  • 4. Contd…. • What is Data? 46/6/2019 By: Tekendra Nath Yogi • Collection of data objects and their attributes • An attribute is a property or characteristic of an object – Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, dimension, or feature • A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes/ dimension Objects
  • 5. Contd…. • Attribute Values: – Attribute values are numbers or symbols assigned to an attribute for a particular object. – Distinction between attributes and attribute values • Same attribute can be mapped to different attribute values – Example: height can be measured in feet or meter • Different attributes can be mapped to the same set of values – Example: Attribute values for ID and age are integers – But properties of attribute values can be different 56/6/2019 By: Tekendra Nath Yogi
  • 6. Contd…. • Types of Attributes : – The type of attribute is determined by the set of possible values the attribute can have. – There are different types of attributes: • Nominal attributes • Binary attributes • Ordinal attributes • Numeric attributes – Interval-scaled attributes – Ratio-scaled attributes 66/6/2019 By: Tekendra Nath Yogi
  • 7. Contd.. • Nominal Attributes: – Nominal means “relating to names.” – The values of a nominal attribute are symbols or names of things. – Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. – The values do not have any meaningful order. – E.g., : Hair_color = { black, brown, grey, red, white, etc} Marital _status= { single, married, divorced} June 6, 2019 7By: Tekendra Nath Yogi
  • 8. Contd.. • Binary Attributes: – A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. – Symmetric binary: both outcomes equally important • e.g., gender = {male ,female} – Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome(rarest) (e.g., HIV positive) and other by 0(e.g., HIV negative) June 6, 2019 8By: Tekendra Nath Yogi
  • 9. Contd.. • Ordinal Attribute: – An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known. – E.g.: suppose that drink size corresponds to the size of drinks available at a fast-food restaurant. This nominal attribute has three possible values: small, medium, and large. The values have a meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the values how much bigger, say, a medium is than a large. June 6, 2019 9By: Tekendra Nath Yogi
  • 10. Contd.. • Numeric Attributes: – It is a measurable quantity, represented in integer or real values. – Numeric attributes can be interval-scaled or ratio-scaled. – Interval-scaled : • Interval-scaled attributes are measured on a scale of equal-size units. • E.g.: calendar dates: For instance, the years 2002 and 2010 are eight years apart. – Ratio-scaled attributes: • a value as being a multiple (or ratio) of another value. • examples of ratio-scaled attributes include count attributes such as years of experience (e.g., the objects are employees) June 6, 2019 10By: Tekendra Nath Yogi
  • 11. Discrete vs. Continuous Attributes • There are many ways to organize attribute types. The types are not mutually exclusive. • Discrete Attribute – Has only a finite or countably infinite set of values • E.g., Roll number, hari_colour, or the set of words in a collection of documents – Sometimes, represented as integer variables – Note: Binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values • E.g., temperature, Speed, etc – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating-point variables June 6, 2019 11By: Tekendra Nath Yogi
  • 12. Types of data sets • Record – Data Matrix – Document Data – Transaction Data • Graph – World Wide Web – Social or information networks • Ordered – Video data – Temporal Data – Sequential Data June 6, 2019 12By: Tekendra Nath Yogi
  • 13. Important Characteristics of Data – Dimensionality (number of attributes) • High dimensional data brings a number of challenges – Sparsity • Only presence counts – Resolution • Patterns depend on the scale – Size • Type of analysis may depend on size of data June 6, 2019 13By: Tekendra Nath Yogi
  • 14. Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 June 6, 2019 14By: Tekendra Nath Yogi
  • 15. Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 1.12.216.226.2512.65 1.22.715.225.2710.23 ThicknessLoadDistanceProjection of y load Projection of x Load 1.12.216.226.2512.65 1.22.715.225.2710.23 ThicknessLoadDistanceProjection of y load Projection of x Load June 6, 2019 15By: Tekendra Nath Yogi
  • 16. Document Data • Each document becomes a „term‟ vector – Each term is a component (attribute) of the vector – The value of each component is the number of times the corresponding term occurs in the document. Document 1 season timeout lost win game score ball play coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 June 6, 2019 16By: Tekendra Nath Yogi
  • 17. Transaction Data • A special type of record data, where – Each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk June 6, 2019 17By: Tekendra Nath Yogi
  • 18. Graph Data • Examples: – Generic graph – World-wide web – Social or information networks 5 2 1 2 5 June 6, 2019 18By: Tekendra Nath Yogi
  • 19. Ordered Data • Video data: sequence of images • Temporal data: time-series • Sequential Data: transaction sequences June 6, 2019 19By: Tekendra Nath Yogi
  • 20. Data Preprocessing • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“ ” – noisy: containing errors or outliers e.g., Salary=“-10” – inconsistent: containing discrepancies in codes or names e.g., Age=“62” Birthday=“03/07/1997” • No quality data, no quality mining results! • So, data should be preprocessed to make it ready for quality mining 206/6/2019 By: Tekendra Nath Yogi
  • 21. Contd…. • Major Tasks in Data Preprocessing: – Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies – Data integration • Integration of multiple databases, data cubes, or files – Data transformation • Normalization and aggregation – Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results – Data discretization • where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). 216/6/2019 By: Tekendra Nath Yogi
  • 23. 1. Data Cleaning • If data is dirty(incomplete, noisy, inconsistent) then: – Can cause confusion for the data mining procedure, resulting in unreliable output. – Users can not trust the any results of data mining • So, data cleaning is required. • To clean data the following data cleaning tasks are performed: – Fill in missing values – Identify outliers and smooth out noisy data – Correct inconsistent data June 6, 2019 23By: Tekendra Nath Yogi
  • 24. Contd… • Missing Data: – Data may not be always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data – Missing data may be due to: • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry – Missing data may need to be inferred. June 6, 2019 24By: Tekendra Nath Yogi
  • 25. Contd…. • How to Handle Missing Data? – Ignore the tuple: • usually done when class label is missing – Fill in the missing value manually: • tedious + infeasible? – Use a global constant to fill in the missing value: • e.g., “unknown” – Use the attribute mean to fill in the missing value – Use the most probable value to fill in the missing value: June 6, 2019 25By: Tekendra Nath Yogi
  • 26. Contd… • Noise: – Random error or variance in a measured variable – Noise (Incorrect attribute) values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention June 6, 2019 26By: Tekendra Nath Yogi
  • 27. Contd…. • How to Handle Noisy Data? – Binning method: • first sort data and partition into bins • then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. – Clustering • detect and remove outliers – Combined computer and human inspection • detect doubtful values and check by human – Regression • smooth by fitting the data into regression functions June 6, 2019 27By: Tekendra Nath Yogi
  • 28. Binning • Three step process: – Sort the data – Make the bins by partitioning – Smooth the data in each bins June 6, 2019 28By: Tekendra Nath Yogi
  • 29. Contd… • Partitioning techniques to make bins: – Equal-width (distance) partitioning: – Equal-depth (frequency) partitioning June 6, 2019 29By: Tekendra Nath Yogi
  • 30. Contd… – Equal-width (distance) partitioning: • It divides the range into N intervals of equal size • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. – Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples June 6, 2019 30By: Tekendra Nath Yogi
  • 31. Example: Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 June 6, 2019 31By: Tekendra Nath Yogi
  • 32. 2. Data Integration • Data integration combines data from multiple sources into a coherent store(e.g., DW). • Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. • This can help improve the accuracy and speed of the subsequent data mining process. June 6, 2019 32By: Tekendra Nath Yogi
  • 33. Contd.. • But, Entity identification problem: • for the same real world entity, attribute values from different sources are different. • So how to identify these two attributes are for the same real world entity. • For example: how can the data analyst or the computer be sure that customer id in one database and cust_number in another refer to the same attribute? • So how to transform one attribute into the another attribute during integration? • Solution: The Meta data can be used to help the transformation of data June 6, 2019 33By: Tekendra Nath Yogi
  • 34. 3. Data Transformation • In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. • Data Transformation Strategies: – Aggregation: For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. – Normalization: where the attribute data are scaled so as to fall within a smaller range, such as: -1.0 to 1.0, or 0.0 to 1.0. June 6, 2019 34By: Tekendra Nath Yogi
  • 35. 4. Data Reduction • Warehouse may store terabytes of data: – Complex data mining may take a very long time to run on the complete data set. • Data reduction – Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results. June 6, 2019 35By: Tekendra Nath Yogi
  • 36. Data Reduction Strategies • Data reduction strategies – Data cube aggregation – Dimensionality reduction – Histograms – clustering – sampling June 6, 2019 36By: Tekendra Nath Yogi
  • 37. Contd.. • Data Cube Aggregation: – Data for sales per quarter, for the years 2008 to 2010. – interested in the annual sales (total per year), rather than the total per quarter. – Thus, the data can be aggregated so that the resulting data summarize the total sales per year instead of per quarter. – The resulting data set is smaller in volume, without loss of information necessary for the analysis task. – This aggregation is illustrated in Figure below. June 6, 2019 37By: Tekendra Nath Yogi
  • 38. Contd.. – Dimensionality Reduction: • To reduce the dimensionality perform Feature selection (i.e., attribute subset selection). • I.e., create a data set containing only the relevant attributes for a current analysis. • that reduces the number of patterns in the result of data mining, so that it becomes easier to understand June 6, 2019 38By: Tekendra Nath Yogi
  • 39. Contd.. For Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} June 6, 2019 39By: Tekendra Nath Yogi
  • 40. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000 June 6, 2019 40By: Tekendra Nath Yogi
  • 41. Clustering • Partition data set into clusters, and one can store cluster representation only. June 6, 2019 41By: Tekendra Nath Yogi
  • 42. Sampling • Sampling is the main technique employed for data reduction. – It is often used for both the preliminary investigation of the data and the final data analysis. • Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. • Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming. June 6, 2019 42By: Tekendra Nath Yogi
  • 43. Contd… • The key principle for effective sampling is the following: – Using a sample will work almost as well as using the entire data set, if the sample is representative – A sample is representative if it has approximately the same properties (of interest) as the original set of data June 6, 2019 43By: Tekendra Nath Yogi
  • 44. Data Warehouse: A Three-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server June 6, 2019 44By: Tekendra Nath Yogi
  • 45. OLAP • OLAP is a software technology concerned with fast analysis of enterprise information. • Often OLAP systems are data warehouse front end software tools to make aggregate data available efficiently to an enterprise‟s decision makers (analysts, managers and executives). • Major OLAP applications are trend analysis over a number of time periods, slicing, dicing , drill-down and roll-up to look at the data at different levels of detail and pivoting or rotating to obtain a new multidimensional view. June 6, 2019 45By: Tekendra Nath Yogi
  • 46. Characteristics of OLAP Systems • Users: – OLAP systems are designed for decision makers. Therefore, an OLAP system is likely to be accessed only by a selected group of managers and may be used by dozens of users. • Functions: – OLAP systems are management critical to support an enterprise's decision support functions using analytical investigations 466/6/2019 By: Tekendra Nath Yogi
  • 47. Contd…. • Nature: – Nature of usage of OLTP system is repetitive – Nature of usage of OLAP system is mostly ad hoc • Design: – OLAP systems are designed to be subject-oriented. – OLAP systems view enterprise information as multidimensional. 476/6/2019 By: Tekendra Nath Yogi
  • 48. Contd…. • Data: – OLAP systems require historical data over several years since trend are often important in decision making. • Kinds of use: – OLAP systems normally do not update the data but refresh the data. 486/6/2019 By: Tekendra Nath Yogi
  • 49. FASMI Characteristics of OLAP systems • The FASMI characteristics of OLAP systems, the name derived from the first letters of the characteristics, are: – Fast – Analytic – Shared – Multidimensional – Information June 6, 2019 49By: Tekendra Nath Yogi
  • 50. Contd.. • Fast: – OLAP queries should be answered very quickly, perhaps within seconds. – To achieve such performance: • the data structure must be efficient and the hardware must be powerful. • Full pre-computation of aggregates • Pre-compute the most commonly queried aggregates. June 6, 2019 50By: Tekendra Nath Yogi
  • 51. Contd.. • Analytic: – An OLAP system must provide rich analytic functionality and it is expected that most OLAP queries can be answered without any programming. – The system should be able to cope with any relevant queries for the application and the user. June 6, 2019 51By: Tekendra Nath Yogi
  • 52. Contd.. • Shared: – An OLAP system is a shared resource although it is unlikely to be shared by hundreds of users. – An OLAP system is likely to be accessed only by a selected group of managers and may be use by mere dozens of users. – Being a shared system, an OLAP system should provide adequate security for confidentiality as well as integrity. June 6, 2019 52By: Tekendra Nath Yogi
  • 53. Contd.. • Multidimensional: – This is the basic requirement. – Whatever OLAP software is being used, it must provide a multidimensional conceptual view of the data. June 6, 2019 53By: Tekendra Nath Yogi
  • 54. Contd.. • Information: – OLAP systems usually obtain information from a data warehouse. – The system should be able to handle a large amount of input data. – The capacity of an OLAP system to handle information and its integration with the data warehouse may be critical. June 6, 2019 54By: Tekendra Nath Yogi
  • 55. Codd’s OLAP characteristics • The most important characteristics of OLAP systems provided by the Codd are as follows: – Multidimensional conceptual view – Accessibility(OLAP as a mediator) – Batch extraction vs interpretive – Multi-user support – Storing OLAP result – Extraction of missing values – Treatment of missing values – Uniform reporting performance – Generic dimensionality – Unlimited dimensions and aggregation levels June 6, 2019 55By: Tekendra Nath Yogi
  • 56. Contd.. • Multidimensional conceptual view: – By requiring a multidimensional view, it is possible to carry out operations like slice and dice. • Accessibility (OLAP as a mediator): – The OLAP software should be sitting between data sources(e..g., a data warehouse) and an OLAP front- end. June 6, 2019 56By: Tekendra Nath Yogi
  • 57. Contd.. • Batch extraction versus interpretive: – An OLAP system should provide multidimensional data staging plus partial pre-calculation of aggregates in large multidimensional databases. • Multi- user support: – Since the OLAP system is shared, the OLAP software should provide many normal database operations including retrieval, update, concurrency control, integrity and security. June 6, 2019 57By: Tekendra Nath Yogi
  • 58. Contd.. • Storing OLAP results: – OLAP results data should be kept separate from source data. • Extraction of missing values: – The OLAP system should distinguish missing values form zero values. – A large data cube may have a large number of zeros as well as some missing values. – If a distinction is not made between zero values and missing values, the aggregates are likely to be computed incorrectly. June 6, 2019 58By: Tekendra Nath Yogi
  • 59. Contd.. • Treatment of missing values: – An OLAP system should ignore all missing values regardless of their source. – Correct aggregate values will be computed once the missing values are ignored. • Uniform reporting performance: – Increasing the number of dimensions or database size should not significantly degrade the reporting performance of the OLAP system. – This is good objective although it may be difficult to achieve in practice. June 6, 2019 59By: Tekendra Nath Yogi
  • 60. Contd.. • Generic dimensionality: – An OLAP system should treat each dimension as equivalent in both its structure and operational capabilities. • Unlimited dimensions and aggregation levels: – An OLAP system should allow unlimited dimensions and aggregations and aggregation levels. – but In practice, this is undesirable. June 6, 2019 60By: Tekendra Nath Yogi
  • 61. Multidimensional data model • Data warehouses and OLAP tools are based on a multidimensional data model. • This model views data in the form of a data cube(models n-dimensional data). • What is a data cube? – The data cube is a metaphor for multidimensional data storage. • A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. – Usually cubes are 3-D geometric structures, But in data warehousing the data cube is n-dimensional and do not confine data to 3-D. 616/6/2019 By: Tekendra Nath Yogi
  • 62. Contd… • Dimensions: – dimensions are the perspectives or entities with respect to which an organization wants to keep records. – For example: In a sales data warehouse for a store dimensions can be: time, item, branch, and location. – These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold. • Dimension table: – Each dimension may have a table associated with it, called dimension table, which further describe the dimension. – E.g., item (item_name, brand, type), or time(day, week, month, quarter, year) 626/6/2019 By: Tekendra Nath Yogi
  • 63. Contd… • Fact and Fact table: – A multidimensional data model is usually organized around a central theme (e.g., sales). – Numeric measures on this theme are called facts, and they are used to analyze the relationships between the dimensions. – The fact table contains the names of the facts, or measures (such as dollars_sold) , as well as keys to each of the related dimension tables. 636/6/2019 By: Tekendra Nath Yogi
  • 64. Contd.. • a simple 2-D data cube: a table or spreadsheet • E.g., June 6, 2019 64By: Tekendra Nath Yogi
  • 65. Contd.. • 3-D data cube: a set of similarly structured 2-D tables stacked on top of one another. • E.g., June 6, 2019 65By: Tekendra Nath Yogi
  • 66. Contd.. • The 3-D data in the table are represented as a series of 2-D tables called 3-D data cube, as in Figure below. • Fig: A 3-D data cube representation of the data in Table previous slide, according to time, item, and location. June 6, 2019 66By: Tekendra Nath Yogi
  • 67. Contd.. • 4-D cubes: a 4-D cube is a series of 3-D cubes, as shown in Figure below: • in this way, we may display any n-dimensional data as a series of (n-1)- Dimensional “cubes.” June 6, 2019 67By: Tekendra Nath Yogi
  • 68. Data Cube implementation • Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries to answered in the order of seconds. It is crucial for data warehouse systems to support highly efficient cube computation techniques, access methods and query processing techniques. • Efficient data cube computation: – No Materialization – Full Materialization – Partial Materialization • Access methods: How OLAP data can be indexed(Bit map and join indices) • Query processing technique • OLAP server types – ROLAP – MOLAP – HOLAP June 6, 2019 68By: Tekendra Nath Yogi
  • 69. Contd….. • Cube: A Lattice of Cuboids – Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given dimensions. – The result would form a lattice of cuboids, each showing the data at a different level of summarization(or group-by/aggregation). – In data warehousing literature, the most detailed part of the cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. 696/6/2019 By: Tekendra Nath Yogi
  • 70. Contd… • Example: – Suppose that you would like to create a data cube for a sales store that contains : city, item, year, as the dimensions for the data cube and sales_in_dollars as the measure. – The possible group-by‟s are the following: • {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ( )}, – These group-by‟s form a lattice of cuboids for the data cube, as shown in Figure below. 706/6/2019 By: Tekendra Nath Yogi
  • 71. Contd.. June 6, 2019 71By: Tekendra Nath Yogi
  • 72. Contd… • Two special types of cuboids: – Base Cuboid: • The base cuboid contains the all data for any combination of the given n-dimensions. • So, the most detailed part of the cube is called a base cuboid. • The base cuboid is the least generalized(most specific) of the cuboids. – Apex cuboid: • The top most 0-D cuboid, which holds the highest level of summarization is called the apex cuboid. • The apex cuboid is the most general(least specific) of the cuboids and is often denotd as all. 726/6/2019 By: Tekendra Nath Yogi
  • 73. • Curse of Dimensionality: • OLAP may need to access different cuboids for different queries. • a good idea : • compute all or at least some of the cuboids in a data cube in advance. • Pre-computation leads to fast response time and • avoids some redundant computation. • But, required storage space may explode (due to pre-computation of all cuboid, large number of dimensions and large number of concept hierarchies of dimensins) • This problem is referred to as the curse of dimensionality Contd…
  • 74. There are three choices for data cube materialization(computation of cuboids) given a base cuboid: 1. No Materialization 2. Full Materialization 3. Partial Materialization Data cube Materialization
  • 75. •No Materialization Do not pre-compute any of the “non-base” cuboid. This leads to computing expensive multidimensional aggregates on the fly, which can be extremely slow. Contd..
  • 76. •Full Materialization •Pre-compute all of the cuboids. •The resulting lattice of computed cuboids is referred to as the full cube. •This choice typically requires huge amounts of memory space in order to store all of the pre-computed cuboids. Contd..
  • 77. •Partial Materialization Selectively compute a proper subset of the whole set of possible cuboids. It represents an interesting trade-off between storage space and response time. The partial materialization of cuboids or sub-cubes should consider three factors:  Identify the subset of cuboids or sub-cubes to materialize  Exploit the materialized cuboids or sub-cubes during query processing  Efficiently update the materialized cuboid or sub-cubes during load and refresh. Contd..
  • 78. Indexing OLAP DATA • To facilitate efficient data accessing to further speed up query processing. • Two most commonly used methods • The Bitmap indexing method and • Join indexing method
  • 79. • Bitmap Indexing: • In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the domain of the attribute. • If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0. • Bitmap indexing reduces join, aggregation, and comparison operations to bit arithmetic. Contd..
  • 80. Figure below shows a base (data) table containing the dimensions item and city, and its mapping to bitmap index tables for each of the dimensions. Contd..
  • 81. • Join indexing: • The join indexing method gained popularity from its use in relational database query processing. • Join indexing registers the joinable rows of two relations from a relational database. • Hence, the join index records can identify joinable tuples without performing costly join operations. Contd..
  • 82. Example: join index relationship between the sales fact table and the location and item dimension tables is shown in figure below Contd.. Here, the “Main Street” value in the location dimension table joins with tuples T57, T238, and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of the sales fact table.
  • 83. Contd.. The corresponding join index tables are shown in Figure below.
  • 84. •The purpose of materializing cuboids and constructing OLAP index structures is to speed up query processing in data cubes. Given materialized views, query processing should proceed as follows: 1. Determine which operations should be performed on the available cuboids. 2. Determine to which materialized cuboid(s) the relevant operations should be applied. Efficient Processing of OLAP Queries
  • 85. Types of OLAP Servers • OLAP servers present business users with multidimensional data from data warehouses, without concerns regarding how or where the data are stored. • However, the physical architecture and implementation of OLAP servers must consider data storage issues. • Implementations of a warehouse server for OLAP processing include the following:  Relational OLAP (ROLAP)  Multidimensional OLAP (MOLAP)  Hybrid OLAP (HOLAP) June 6, 2019 85By: Tekendra Nath Yogi
  • 86. Contd.. • Relational OLAP (ROLAP) Server: – These are the intermediate servers that stand in between a relational back- end server and client front-end tools. – They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. – ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic, and additional tools and services. – ROLAP technology tends to have greater scalability than MOLAP technology. June 6, 2019 86By: Tekendra Nath Yogi
  • 88. Contd.. • Multidimensional OLAP (MOLAP) Server: – These servers supports multidimensional views of data through array-based multidimensional storage engines. – They map multidimensional views directly to data cube array structures. – The advantages of using a data cube is that it allows fast indexing to pre- computed summarized data. – In multidimensional data stores, the storage utilization may be low if the data set is sparse. June 6, 2019 88By: Tekendra Nath Yogi
  • 90. Contd.. • Hybrid OLAP (HOLAP) Servers: • The hybrid OLAP approach combines ROLAP and MOLAP technology. • Benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. June 6, 2019 90By: Tekendra Nath Yogi
  • 91. Contd.. • MOLAP vs. ROLAP: June 6, 2019 91By: Tekendra Nath Yogi MOLAP ROLAP Information retrieval is fast. Information retrieval is comparatively slow. Uses sparse array to store data-sets. Uses relational table. MOLAP is best suited for inexperienced users, since it is very easy to use. ROLAP is best suited for experienced users. Maintains a separate database for data cubes. It may not require space other than available in the Data warehouse. DBMS facility is weak. DBMS facility is strong.
  • 92. Data Cube operations(OLAP operations) • A number of operations may be applied to data cubes for OLAP. • The common ones are: – Slice – dice – Roll-up(Drill-up) – Drill-down(Roll-down) – Pivot(Rotate) June 6, 2019 92By: Tekendra Nath Yogi
  • 93. Contd.. • The cube contains the dimensions: location, time, and item, where location is aggregated with respect to city values, time is aggregated with respect to quarters, and item is aggregated with respect to item types. – The measure displayed is dollars sold (in thousands). – The data examined are for the cities Chicago, New York, Toronto, and Vancouver. June 6, 2019 93By: Tekendra Nath Yogi A data cube for sales store to illustrate data cube operation:
  • 94. Contd.. • Slice: – Slice operation performs a selection on one dimension of the given cube, thus creates sub-cube of a cube. – Below example depicts how slice operation works- Where the sales data are selected from the central cube for the dimension time using the criterion time=“Q1” June 6, 2019 94By: Tekendra Nath Yogi
  • 95. Contd.. • Dice: – Dice operation performs a selection on two or more dimension from a given cube and creates a sub-cube. • Below example depicts how Dice operation works- based on the following selection criteria: (location = “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”) and (item = “Mobile” or “Modem”). June 6, 2019 95By: Tekendra Nath Yogi
  • 96. Contd.. • Roll-up(Drill-up): – The roll-up operation performs aggregation on a data cube, either : • by climbing up a concept hierarchy for a dimension or • by dimension reduction. – Below example depicts how roll-up operation works- June 6, 2019 96By: Tekendra Nath Yogi
  • 97. Contd.. • Drill-down(Roll-down): – Drill-down is the reverse operation of roll-up. It is performed by either of the following ways: • By stepping down a concept hierarchy for a dimension • By introducing a new dimension. – It allows users to navigate among different levels of data i.e. most summarized (up) to most details (down). – Below example depicts how Drill-down operation works June 6, 2019 97By: Tekendra Nath Yogi
  • 98. Contd.. • Pivot: – Pivot also known as rotation changes the dimensional rotation of the cube, i.e. rotates the axes to view the data from different perspectives. The below cubes shows 2D representation of Pivot June 6, 2019 98By: Tekendra Nath Yogi
  • 99. Guidelines for OLAP implementation • A number of Guidelines for successful implementation of OLAP are as follows: – Vision – Senior management support – Selecting an OLAP tool – Corporate strategy – Focus on the users – Joint management – Review and adapt June 6, 2019 99By: Tekendra Nath Yogi
  • 100. Contd.. • Vision: – The OLAP team must, in consultation with the users, develop a clear vision for the OLAP system. This vision including the business objectives should be clearly defined, understood, and shared by the stakeholders. • Senior management support: – The OLAP project should fully supported by the senior managers, since a data warehouse may have been developed already this should not be difficult. June 6, 2019 100By: Tekendra Nath Yogi
  • 101. Contd.. • Selecting an OLAP tool: – The OLAP team should familiarize themselves with the ROLP and MOLAP tools available in the market. Since tools are quite different, careful planning may be required in selecting a tool that is appropriate for the enterprise. In some situations, a combination of ROLAP and MOLAP may be most effective. • Corporate strategy: – The OLAP strategy should fit with the enterprise strategy and business objectives. A good fit will result in the OLAP tools being used more widely. June 6, 2019 101By: Tekendra Nath Yogi
  • 102. Contd.. • Focus on users: – The OLAP project should be focused on users. Users should, in consultation with the technical professionals, decide what tasks will be done first and what will be done later. Attempts should be made to provide each user with a tool suitable for that person‟s skill level and information needs. A good GUI user interface should be provided to non-technical users. The project can only be successful whit the full support of the users. June 6, 2019 102By: Tekendra Nath Yogi
  • 103. Contd.. • Joint Management: – The OLAP project must be managed by both the IT and business professional. Many other people should be involved in supplying ideas. An appropriate committee structure may be necessary to channel these ideas • Review and adapt: – Organizations evolve and so must be OLAP system. Regular reviews of the project may be required to ensure that the project is meeting the current needs of the enterprise. June 6, 2019 103By: Tekendra Nath Yogi
  • 104. Home Work • What are dimension, members, measure and fact table? • List the major difference between OLTP systems and OLAP systems. • What is OLAP and its purpose? List the characteristics of OLAP systems. • What is data cube and purpose of data cube? Use an example to illustrate the use of data cube. • What are ROLAP and MOLAP ?describe the two approaches and list their advantages and disadvantages. • Describe the operations(OLAP/ Cube operations) roll-up, drill-down, and slice and dice. • List the implementation guidelines for implementing OLAP. June 6, 2019 104By: Tekendra Nath Yogi
  • 105. Thank You ! 105By: Tekendra Nath Yogi6/6/2019