SlideShare a Scribd company logo
1 of 81
Download to read offline
DAT630

Exploring Data
Darío Garigliotti | University of Stavanger
18/09/2017
Introduction to Data Mining, Chapter 3
Data Exploration
- Preliminary investigation of the data in order to
better understand its specific characteristics

- Can aid in selecting the appropriate
preprocessing and data analysis techniques

- Can even address some of the questions
typically answered by data mining

- Finding patterns by visually inspecting the data
Three major topics
- Summary statistics

- Visualization

- On-Line Analytical Processing (OLAP)
The Iris Data Set
The Iris data set
- Introduced in 1936 by Ronald Fisher

- 50 samples from each of three species of Iris

- Iris setosa, Iris virginica, and Iris versicolor
- Four features from each sample

- The length and the width of the sepals and petals
Four Features
Summary Statistics
Summary Statistics
- Quantities that capture various
characteristics of a potentially large set of
values with a single number (or a small set of
numbers)

- Examples

- Average household income
- Fraction of students who complete a BSc in 3 years
Frequency
- The frequency of an attribute value is the
percentage with which the value occurs in the 

data set 

- For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time
- x is a categorical attribute that can take values 

{v1,…,vk} and there are m objects in total
frequency(vi) =
number of objects with attribute value vi
m
Mode
- The mode of an attribute is the most frequent
attribute value

- The notion of a mode is only interesting if attribute
values have different frequencies
- The notions of frequency and mode are
typically used with categorical data
Example
- What are the frequencies?

- What is the mode?
Age Count Frequency
0-9 3
10-19 4
20-29 15
30-39 12
40-49 8
50-59 2
Total 44
Example
- What are the frequencies?

- What is the mode?
Age Count Frequency
0-9 3 0.068
10-19 4 0.090
20-29 15 0.340
30-39 12 0.272
40-49 8 0.181
50-59 2 0.045
Total 44
Mode
Percentiles
- For continuous data, the notion of a percentile
is more useful 

- Given an ordinal or continuous attribute x and
a number p between 0 and 100, the pth
percentile is a value xp of x such that p% of the
observed values of x are less than xp

- For instance, the 50th percentile is the value x50%
such that 50% of all values of x are less than x50%
- min(x)= x0% max(x)= x100%
Example
- What is the 80th percentile of this data?

- 8, 6, 3, 7, 3, 4, 1, 6, 8, 5
Example
- Sort the data 

- 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
80% of the values are smaller than 8
Mean and Median
- Most widely used statistics for continuous data

- Let x be an attribute and {x1,…,xm} the values
of the attribute for a set of m objects

- Let {x(1),…,x(m)} the set of values after sorting

- I.e., x(1)=min(x) and x(m)=max(x)
mean(x) = ¯x =
1
m
mX
i=1
xi
Mean and Median (2)
- The middle value if there is an odd number of
values, and the average of the two middle
values if the number of values is even
⇢
x(r+1), if m is odd (i.e., m = 2r + 1)
1
2 (x(r) + x(r+1)), if m is even (i.e., m = 2r)
median(x) =
Mean vs. Median
- Both indicate the "middle" of the values

- If the distribution of values is skewed, then the
median is a better indicator of the middle

- The mean is sensitive to the presence of
outliers; the median provides a more robust
estimate of the middle
Trimmed Mean
- To overcome problems with the traditional
definition of a mean, the notion of a trimmed
mean is sometimes used

- A percentage p between 0 and 100 is
specified; the top and bottom (p/2)% of the
data is thrown out; then mean is calculated the
normal way

- Median is a trimmed mean with p=100%, the
standard mean corresponds to p=0%
Example
- Consider the set of values {1, 2, 3, 4, 5, 90}

- What is the mean?

- What is the median?

- What is the trimmed mean with p=40%?
Example
- Consider the set of values {1, 2, 3, 4, 5, 90}

- What is the mean? 17.5

- What is the median? (3+4)/2 = 3.5

- What is the trimmed mean with p=40%? 3.5

- Trimmed values (with top-20% and bottom-20% of
the values thrown out): {2,3,4,5}
Range and Variance
- To measure the dispersion/spread of a set of
values (for continuous data)

- Range
- Variance*

- Standard deviation is the square root of variance
range(x) = max(x) min(x) = x(m) x(1)
*This variant is known as the "bias-corrected sample variance"
variance(x) = s2
x =
1
m 1
mX
i=1
(xi ¯x)2
Range vs. Variance
- Range can be misleading if the values are
concentrated in a narrow area, but there are
also a relatively small number of extreme
values

- Hence, the variance is preferred as a measure
of spread
Example
- What is the range and variance of the following
data?

- 3 24 30 47 43 7 47 13 44 39
Example
- What is the range and variance of the following
data?

- 3 24 30 47 43 7 47 13 44 39
- Range: 47-3 = 44
- Variance: 289.57
- mean: 29.7
More Robust Estimates of
Spread
- Variance is particularly sensitive to outliers

- The mean can be distorted by outliers; variance
uses the squared difference between the mean and
other values
- Absolute Average Deviation (AAD)
AAD(x) =
1
m
mX
i=1
|xi ¯x|
More Robust Estimates of
Spread (2)
- Median Absolute Deviation (MAD)

- Interquartile Range (IQR)
MAD(x) = median |x1 ¯x|, . . . , |xm ¯x|
IQR(x) = x75% x25%
More Robust Estimates of
Spread (3)
Example
Exercises
Visualization
Goals and Motivation
- Data visualization is the display of information
in a graphic or tabular format

- The motivation for using visualization is that
people can quickly absorb large amounts of
visual information and find patterns in it

- Visualization is a powerful and appealing
technique for data exploration

- Humans can easily detect general patterns and
trends as well as outliers and unusual patterns
Example
- Sea Surface Temperature (SST) for July 1982

- Tens of thousands of data points are summarized in
a single figure
Outline for this part
- General concepts

- Representation
- Arrangement
- Selection
- Visualization techniques

- Histograms
- Box plots
- Scatter plots
- Contour plots
- …
Representation
- Mapping of information to a visual format

- Data objects, their attributes, and the
relationships among data objects are
translated into graphical elements such as
points, lines, shapes, and colors

- Objects are often represented as points
- Attribute values can be represented as the
position of the points or using color, size, shape, etc.
- Position can express the relationships among
points
Arrangement
- Placement of visual elements within a display

- Can make a large difference in how easy it is to
understand the data
vs.
Selection
- Elimination or the de-emphasis of certain
objects and attributes

- May involve the choosing a subset of attributes 

- Dimensionality reduction is often used to reduce the
number of dimensions to two or three
- Alternatively, pairs of attributes can be considered
- May also involve choosing a subset of objects

- Visualizing all objects can result in a display that is
too crowded
Outline for this part
- General concepts

- Visualization techniques

- Yet techniques are often very specialized in the data
being analyzed, it is possible to group them by some
general properties
- For example, visualization techniques for:
- Small number of attributes
- Spatio-temporal data
- High-dimensional data
Outline for this part
- General concepts

- Visualization techniques

- Histograms
- Box plots
- Scatter plots
- Contour plots
- Matrix plots
- Parallel coordinates
- Star plots
- Chernoff faces
Histograms
- Usually shows the distribution of values of a
single variable

- Divide the values into bins and show a bar plot
of the number of objects in each bin. 

- The height of each bar indicates the number of
objects

- Shape of histogram depends on the number of
bins
Example
- Petal width
10 bins 20 bins
2D Histograms
- Show the joint distribution of the values of two
attributes 

- Each attribute is divided into intervals and the two
sets of intervals define two-dimensional rectangles of
values
- It can show patterns not present in 1D ones

- Visually more complicated, e.g., some columns may
be hidden by others
Example
- Petal width and petal length
Box Plots
- Way of displaying the distribution of data
outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
outlier
10th perce
25th perce
75th perce
50th perce
10th perce
Example
- Comparing attributes
Pie Charts
- Similar to histograms, but typically used with
categorical attributes that have a relatively
small number of values

- Common in popular articles, but used less
frequently in technical publications

- The size of relative areas can be hard to judge
- Histograms are preferred for technical work!
http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Empirical Cumulative
Distribution Function
Scatter Plots
- Attributes values determine the position

- Two-dimensional scatter plots most common,
but can have three-dimensional scatter plots

- Often additional attributes can be displayed by
using the size, shape, and color of the markers
that represent the objects
Example
Example
- Arrays of scatter plots to summarize the
relationships of several pairs of attributes
Contour Plots
- Useful when a continuous attribute is
measured on a spatial grid

- They partition the plane into regions of similar
values

- The contour lines that form the boundaries of
these regions connect points with equal values

- The most common example is contour maps of
elevation
Celsius
Example
Celsius
Parallel Coordinates
- Plot the attribute values of high-dimensional
data

- Instead of using perpendicular axes, use a set
of parallel axes 

- The attribute values of each object are plotted
as a point on each corresponding coordinate
axis and the points are connected by a line,
i.e., each object is represented as a line 

- The ordering of attributes is important
Example
- Different ordering of attributes
Star Plots
- Similar approach to parallel coordinates, but
axes radiate from a central point

- The line connecting the values of an object is a
polygon
Example
Setosa
Versicolour
Virginica
Example
- Other useful information, such as average
values or thresholds, can also be encoded
Chernoff Faces
- Approach created by Herman Chernoff

- Each attribute is associated with a
characteristic of a face

- Size of the face, shape of jaw, shape of forhead, etc.
- The value of the attribute determines the appearance
of the corresponding facial characteristic	
- Each object becomes a separate face

- Relies on human’s ability to distinguish faces
Example
Setosa
Versicolour
Virginica
Principles for Visualization
Quality
- Apprehension to perceive relations among variables

- Clarity to distinguish the most important elements

- Consistency with previous, related graphs

- Efficiency to show complex information in simple ways

- Necessity of the graph, vs alternatives

- Truthfulness when using magnitudes, relative to scales
Infographics
OLAP and Multidimensional
Data Analysis
OLAP
- Relational databases put data into tables, while
OLAP uses a multidimensional array
representation

- Such representations of data previously existed in
statistics and other fields
- There are a number of data analysis and data
exploration operations that are easier with
such a data representation
Converting Tabular Data
- Two key steps in converting tabular data into a
multidimensional array

1.Identify which attributes are to be the
dimensions and which attribute is to be the
target attribute

- The attributes used as dimensions must have
discrete values
- The target value is typically a count or continuous
value, e.g., the cost of an item
- Can have no target variable at all except the count of
objects that have the same set of attribute values
Converting Tabular Data (2)
2.Find the value of each entry in the
multidimensional array by summing the values
(of the target attribute) or count of all objects
that have the attribute values corresponding to
that entry
Example
- Petal width and length are discretized to have
categorical values: low, medium, and high
Example
- Each unique tuple of petal width, petal length,
and species type identifies one element of the
array
Example
- Cross-tabulations can be used to show slices
of the multidimensional array
Example
- Cross-tabulations can be used to show slices
of the multidimensional array
Data Cube
- The key operation of a OLAP is the formation
of a data cube

- A data cube is a multidimensional
representation of data, together with all
possible aggregates

- Aggregates that result by selecting a proper subset
of the dimensions and summing over all remaining
dimensions
Example
- Consider a data set that records the sales of
products at a number of company stores at
various dates
- This data can be represented 

as a 3 dimensional array
- There are 3 two-dimensional

aggregates, 3 one-dimensional
aggregates, and 1 zero-dimensional
aggregate (the overall total)
Example
- This table shows one of the two dimensional
aggregates, along with two of the one-
dimensional aggregates, and the overall total
OLAP Operations
- Slicing

- Dicing

- Roll-up

- Drill-down
Slicing and Dicing
- Slicing is selecting a group of cells from the
entire multidimensional array by specifying a
specific value for one or more dimensions

- Dicing involves selecting a subset of cells by
specifying a range of attribute values

- This is equivalent to defining a subarray from the
complete array
- In practice, both operations can also be
accompanied by aggregation over some
dimensions
Example
Roll-up and Drill-down
- Attribute values often have a hierarchical
structure

- Each date is associated with a year, month, and
week
- A location is associated with a continent, country,
state (province, etc.), and city
- Products can be divided into various categories,
such as clothing, electronics, and furniture
- These categories often nest and form a tree

- A year contains months which contains day
- A country contains a state which contains a city
Example

More Related Content

What's hot

Source of DATA
Source of DATASource of DATA
Source of DATANahid Amin
 
Statistics Based On Ncert X Class
Statistics Based On Ncert X ClassStatistics Based On Ncert X Class
Statistics Based On Ncert X ClassRanveer Kumar
 
Day2 session i&ii - spss
Day2 session i&ii - spssDay2 session i&ii - spss
Day2 session i&ii - spssabir hossain
 
Statistics Math project class 10th
Statistics Math project class 10thStatistics Math project class 10th
Statistics Math project class 10thRiya Singh
 
Maths statistcs class 10
Maths statistcs class 10   Maths statistcs class 10
Maths statistcs class 10 Rc Os
 
Business Statistics
Business StatisticsBusiness Statistics
Business Statisticsshorab
 
Bcs 040 Descriptive Statistics
Bcs 040 Descriptive StatisticsBcs 040 Descriptive Statistics
Bcs 040 Descriptive StatisticsNarayan Thapa
 
Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)HennaAnsari
 
03.data presentation(2015) 2
03.data presentation(2015) 203.data presentation(2015) 2
03.data presentation(2015) 2Mmedsc Hahm
 
15. descriptive statistics
15. descriptive statistics15. descriptive statistics
15. descriptive statisticsAshok Kulkarni
 
Statistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahuStatistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahuSudhir INDIA
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with RDr Nisha Arora
 

What's hot (19)

Source of DATA
Source of DATASource of DATA
Source of DATA
 
Statistics Based On Ncert X Class
Statistics Based On Ncert X ClassStatistics Based On Ncert X Class
Statistics Based On Ncert X Class
 
Day2 session i&ii - spss
Day2 session i&ii - spssDay2 session i&ii - spss
Day2 session i&ii - spss
 
Statistics Math project class 10th
Statistics Math project class 10thStatistics Math project class 10th
Statistics Math project class 10th
 
Maths statistcs class 10
Maths statistcs class 10   Maths statistcs class 10
Maths statistcs class 10
 
Qt notes
Qt notesQt notes
Qt notes
 
Descriptive statistics i
Descriptive statistics iDescriptive statistics i
Descriptive statistics i
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Business Statistics
Business StatisticsBusiness Statistics
Business Statistics
 
Bcs 040 Descriptive Statistics
Bcs 040 Descriptive StatisticsBcs 040 Descriptive Statistics
Bcs 040 Descriptive Statistics
 
Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)Basics of Educational Statistics (Descriptive statistics)
Basics of Educational Statistics (Descriptive statistics)
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
R training4
R training4R training4
R training4
 
03.data presentation(2015) 2
03.data presentation(2015) 203.data presentation(2015) 2
03.data presentation(2015) 2
 
15. descriptive statistics
15. descriptive statistics15. descriptive statistics
15. descriptive statistics
 
STATISTICS
STATISTICSSTATISTICS
STATISTICS
 
Statistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahuStatistics in research by dr. sudhir sahu
Statistics in research by dr. sudhir sahu
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Descriptive statistics -review(2)
Descriptive statistics -review(2)Descriptive statistics -review(2)
Descriptive statistics -review(2)
 

Similar to Data Mining - Exploring Data

1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdfthaersyam
 
Class1.ppt
Class1.pptClass1.ppt
Class1.pptGautam G
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1RajnishSingh367990
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSnagamani651296
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxMuhammadNafees42
 
4. six sigma descriptive statistics
4. six sigma descriptive statistics4. six sigma descriptive statistics
4. six sigma descriptive statisticsHakeem-Ur- Rehman
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg Girls High
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyPrithwis Mukerjee
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyPrithwis Mukerjee
 

Similar to Data Mining - Exploring Data (20)

17329274.ppt
17329274.ppt17329274.ppt
17329274.ppt
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
SP and R.pptx
SP and R.pptxSP and R.pptx
SP and R.pptx
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1Introduction to Statistics - Basics of Data - Class 1
Introduction to Statistics - Basics of Data - Class 1
 
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICSSTATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
STATISTICS BASICS INCLUDING DESCRIPTIVE STATISTICS
 
Class1.ppt
Class1.pptClass1.ppt
Class1.ppt
 
Medical statistics
Medical statisticsMedical statistics
Medical statistics
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
 
4. six sigma descriptive statistics
4. six sigma descriptive statistics4. six sigma descriptive statistics
4. six sigma descriptive statistics
 
Wynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statisticsWynberg girls high-Jade Gibson-maths-data analysis statistics
Wynberg girls high-Jade Gibson-maths-data analysis statistics
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central Tendency
 
QT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central TendencyQT1 - 03 - Measures of Central Tendency
QT1 - 03 - Measures of Central Tendency
 
Exploring Data
Exploring DataExploring Data
Exploring Data
 
Exploring Data
Exploring DataExploring Data
Exploring Data
 
Statistics for 6 Sigma.pptx
Statistics for 6 Sigma.pptxStatistics for 6 Sigma.pptx
Statistics for 6 Sigma.pptx
 

More from Darío Garigliotti

Task-Based Support in Search Engines
Task-Based Support in Search EnginesTask-Based Support in Search Engines
Task-Based Support in Search EnginesDarío Garigliotti
 
About "Towards Better Text Understanding and Retrieval through Kernel Entity ...
About "Towards Better Text Understanding and Retrieval through Kernel Entity ...About "Towards Better Text Understanding and Retrieval through Kernel Entity ...
About "Towards Better Text Understanding and Retrieval through Kernel Entity ...Darío Garigliotti
 
A Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion EnginesA Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion EnginesDarío Garigliotti
 
A Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion EnginesA Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion EnginesDarío Garigliotti
 
A Knowledge Base of Entity-Oriented Search Intents
A Knowledge Base of Entity-Oriented Search IntentsA Knowledge Base of Entity-Oriented Search Intents
A Knowledge Base of Entity-Oriented Search IntentsDarío Garigliotti
 
On Type-Aware Entity Retrieval
On Type-Aware Entity RetrievalOn Type-Aware Entity Retrieval
On Type-Aware Entity RetrievalDarío Garigliotti
 
Learning-to-Rank Target Types for Entity-Bearing Queries
Learning-to-Rank Target Types for Entity-Bearing QueriesLearning-to-Rank Target Types for Entity-Bearing Queries
Learning-to-Rank Target Types for Entity-Bearing QueriesDarío Garigliotti
 
Task-Based Information Retrieval
Task-Based Information RetrievalTask-Based Information Retrieval
Task-Based Information RetrievalDarío Garigliotti
 
Type Information in Entity Retrieval
Type Information in Entity RetrievalType Information in Entity Retrieval
Type Information in Entity RetrievalDarío Garigliotti
 
If this is the answer, what was the question?
If this is the answer, what was the question?If this is the answer, what was the question?
If this is the answer, what was the question?Darío Garigliotti
 
Semi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationSemi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationDarío Garigliotti
 
Semi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationSemi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationDarío Garigliotti
 
Semi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationSemi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationDarío Garigliotti
 

More from Darío Garigliotti (20)

Task-Based Support in Search Engines
Task-Based Support in Search EnginesTask-Based Support in Search Engines
Task-Based Support in Search Engines
 
Task Recommendation
Task RecommendationTask Recommendation
Task Recommendation
 
About "Towards Better Text Understanding and Retrieval through Kernel Entity ...
About "Towards Better Text Understanding and Retrieval through Kernel Entity ...About "Towards Better Text Understanding and Retrieval through Kernel Entity ...
About "Towards Better Text Understanding and Retrieval through Kernel Entity ...
 
A Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion EnginesA Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion Engines
 
A Summary of ECIR'18
A Summary of ECIR'18A Summary of ECIR'18
A Summary of ECIR'18
 
A Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion EnginesA Semantic Search Approach to Task-Completion Engines
A Semantic Search Approach to Task-Completion Engines
 
A Knowledge Base of Entity-Oriented Search Intents
A Knowledge Base of Entity-Oriented Search IntentsA Knowledge Base of Entity-Oriented Search Intents
A Knowledge Base of Entity-Oriented Search Intents
 
On Type-Aware Entity Retrieval
On Type-Aware Entity RetrievalOn Type-Aware Entity Retrieval
On Type-Aware Entity Retrieval
 
Learning-to-Rank Target Types for Entity-Bearing Queries
Learning-to-Rank Target Types for Entity-Bearing QueriesLearning-to-Rank Target Types for Entity-Bearing Queries
Learning-to-Rank Target Types for Entity-Bearing Queries
 
Task-Based Information Retrieval
Task-Based Information RetrievalTask-Based Information Retrieval
Task-Based Information Retrieval
 
Type Information in Entity Retrieval
Type Information in Entity RetrievalType Information in Entity Retrieval
Type Information in Entity Retrieval
 
Type-Aware Entity Retrieval
Type-Aware Entity RetrievalType-Aware Entity Retrieval
Type-Aware Entity Retrieval
 
Type-Aware Entity Retrieval
Type-Aware Entity RetrievalType-Aware Entity Retrieval
Type-Aware Entity Retrieval
 
Dive into Deep Learning
Dive into Deep LearningDive into Deep Learning
Dive into Deep Learning
 
Type-Aware Entity Retrieval
Type-Aware Entity RetrievalType-Aware Entity Retrieval
Type-Aware Entity Retrieval
 
If this is the answer, what was the question?
If this is the answer, what was the question?If this is the answer, what was the question?
If this is the answer, what was the question?
 
Semi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationSemi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense Disambiguation
 
Semi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationSemi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense Disambiguation
 
Type-Aware Entity Retrieval
Type-Aware Entity RetrievalType-Aware Entity Retrieval
Type-Aware Entity Retrieval
 
Semi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense DisambiguationSemi-supervised Learning for Word Sense Disambiguation
Semi-supervised Learning for Word Sense Disambiguation
 

Recently uploaded

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Recently uploaded (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 

Data Mining - Exploring Data

  • 1. DAT630
 Exploring Data Darío Garigliotti | University of Stavanger 18/09/2017 Introduction to Data Mining, Chapter 3
  • 2. Data Exploration - Preliminary investigation of the data in order to better understand its specific characteristics - Can aid in selecting the appropriate preprocessing and data analysis techniques - Can even address some of the questions typically answered by data mining - Finding patterns by visually inspecting the data
  • 3. Three major topics - Summary statistics - Visualization - On-Line Analytical Processing (OLAP)
  • 5. The Iris data set - Introduced in 1936 by Ronald Fisher - 50 samples from each of three species of Iris - Iris setosa, Iris virginica, and Iris versicolor - Four features from each sample - The length and the width of the sepals and petals
  • 8. Summary Statistics - Quantities that capture various characteristics of a potentially large set of values with a single number (or a small set of numbers) - Examples - Average household income - Fraction of students who complete a BSc in 3 years
  • 9. Frequency - The frequency of an attribute value is the percentage with which the value occurs in the 
 data set - For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time - x is a categorical attribute that can take values 
 {v1,…,vk} and there are m objects in total frequency(vi) = number of objects with attribute value vi m
  • 10. Mode - The mode of an attribute is the most frequent attribute value - The notion of a mode is only interesting if attribute values have different frequencies - The notions of frequency and mode are typically used with categorical data
  • 11. Example - What are the frequencies? - What is the mode? Age Count Frequency 0-9 3 10-19 4 20-29 15 30-39 12 40-49 8 50-59 2 Total 44
  • 12. Example - What are the frequencies? - What is the mode? Age Count Frequency 0-9 3 0.068 10-19 4 0.090 20-29 15 0.340 30-39 12 0.272 40-49 8 0.181 50-59 2 0.045 Total 44 Mode
  • 13. Percentiles - For continuous data, the notion of a percentile is more useful - Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value xp of x such that p% of the observed values of x are less than xp - For instance, the 50th percentile is the value x50% such that 50% of all values of x are less than x50% - min(x)= x0% max(x)= x100%
  • 14. Example - What is the 80th percentile of this data? - 8, 6, 3, 7, 3, 4, 1, 6, 8, 5
  • 15. Example - Sort the data - 1, 3, 3, 4, 5, 6, 6, 7, 8, 8 80% of the values are smaller than 8
  • 16. Mean and Median - Most widely used statistics for continuous data - Let x be an attribute and {x1,…,xm} the values of the attribute for a set of m objects - Let {x(1),…,x(m)} the set of values after sorting - I.e., x(1)=min(x) and x(m)=max(x) mean(x) = ¯x = 1 m mX i=1 xi
  • 17. Mean and Median (2) - The middle value if there is an odd number of values, and the average of the two middle values if the number of values is even ⇢ x(r+1), if m is odd (i.e., m = 2r + 1) 1 2 (x(r) + x(r+1)), if m is even (i.e., m = 2r) median(x) =
  • 18. Mean vs. Median - Both indicate the "middle" of the values - If the distribution of values is skewed, then the median is a better indicator of the middle - The mean is sensitive to the presence of outliers; the median provides a more robust estimate of the middle
  • 19. Trimmed Mean - To overcome problems with the traditional definition of a mean, the notion of a trimmed mean is sometimes used - A percentage p between 0 and 100 is specified; the top and bottom (p/2)% of the data is thrown out; then mean is calculated the normal way - Median is a trimmed mean with p=100%, the standard mean corresponds to p=0%
  • 20. Example - Consider the set of values {1, 2, 3, 4, 5, 90} - What is the mean? - What is the median? - What is the trimmed mean with p=40%?
  • 21. Example - Consider the set of values {1, 2, 3, 4, 5, 90} - What is the mean? 17.5 - What is the median? (3+4)/2 = 3.5 - What is the trimmed mean with p=40%? 3.5 - Trimmed values (with top-20% and bottom-20% of the values thrown out): {2,3,4,5}
  • 22. Range and Variance - To measure the dispersion/spread of a set of values (for continuous data) - Range - Variance* - Standard deviation is the square root of variance range(x) = max(x) min(x) = x(m) x(1) *This variant is known as the "bias-corrected sample variance" variance(x) = s2 x = 1 m 1 mX i=1 (xi ¯x)2
  • 23. Range vs. Variance - Range can be misleading if the values are concentrated in a narrow area, but there are also a relatively small number of extreme values - Hence, the variance is preferred as a measure of spread
  • 24. Example - What is the range and variance of the following data? - 3 24 30 47 43 7 47 13 44 39
  • 25. Example - What is the range and variance of the following data? - 3 24 30 47 43 7 47 13 44 39 - Range: 47-3 = 44 - Variance: 289.57 - mean: 29.7
  • 26. More Robust Estimates of Spread - Variance is particularly sensitive to outliers - The mean can be distorted by outliers; variance uses the squared difference between the mean and other values - Absolute Average Deviation (AAD) AAD(x) = 1 m mX i=1 |xi ¯x|
  • 27. More Robust Estimates of Spread (2) - Median Absolute Deviation (MAD) - Interquartile Range (IQR) MAD(x) = median |x1 ¯x|, . . . , |xm ¯x| IQR(x) = x75% x25%
  • 28. More Robust Estimates of Spread (3)
  • 32. Goals and Motivation - Data visualization is the display of information in a graphic or tabular format - The motivation for using visualization is that people can quickly absorb large amounts of visual information and find patterns in it - Visualization is a powerful and appealing technique for data exploration - Humans can easily detect general patterns and trends as well as outliers and unusual patterns
  • 33. Example - Sea Surface Temperature (SST) for July 1982 - Tens of thousands of data points are summarized in a single figure
  • 34. Outline for this part - General concepts - Representation - Arrangement - Selection - Visualization techniques - Histograms - Box plots - Scatter plots - Contour plots - …
  • 35. Representation - Mapping of information to a visual format - Data objects, their attributes, and the relationships among data objects are translated into graphical elements such as points, lines, shapes, and colors - Objects are often represented as points - Attribute values can be represented as the position of the points or using color, size, shape, etc. - Position can express the relationships among points
  • 36. Arrangement - Placement of visual elements within a display - Can make a large difference in how easy it is to understand the data vs.
  • 37. Selection - Elimination or the de-emphasis of certain objects and attributes - May involve the choosing a subset of attributes - Dimensionality reduction is often used to reduce the number of dimensions to two or three - Alternatively, pairs of attributes can be considered - May also involve choosing a subset of objects - Visualizing all objects can result in a display that is too crowded
  • 38. Outline for this part - General concepts - Visualization techniques - Yet techniques are often very specialized in the data being analyzed, it is possible to group them by some general properties - For example, visualization techniques for: - Small number of attributes - Spatio-temporal data - High-dimensional data
  • 39. Outline for this part - General concepts - Visualization techniques - Histograms - Box plots - Scatter plots - Contour plots - Matrix plots - Parallel coordinates - Star plots - Chernoff faces
  • 40. Histograms - Usually shows the distribution of values of a single variable - Divide the values into bins and show a bar plot of the number of objects in each bin. - The height of each bar indicates the number of objects - Shape of histogram depends on the number of bins
  • 41. Example - Petal width 10 bins 20 bins
  • 42. 2D Histograms - Show the joint distribution of the values of two attributes - Each attribute is divided into intervals and the two sets of intervals define two-dimensional rectangles of values - It can show patterns not present in 1D ones - Visually more complicated, e.g., some columns may be hidden by others
  • 43. Example - Petal width and petal length
  • 44. Box Plots - Way of displaying the distribution of data outlier 10th percentile 25th percentile 75th percentile 50th percentile 10th percentile outlier 10th perce 25th perce 75th perce 50th perce 10th perce
  • 46. Pie Charts - Similar to histograms, but typically used with categorical attributes that have a relatively small number of values - Common in popular articles, but used less frequently in technical publications - The size of relative areas can be hard to judge - Histograms are preferred for technical work!
  • 50. Scatter Plots - Attributes values determine the position - Two-dimensional scatter plots most common, but can have three-dimensional scatter plots - Often additional attributes can be displayed by using the size, shape, and color of the markers that represent the objects
  • 52. Example - Arrays of scatter plots to summarize the relationships of several pairs of attributes
  • 53. Contour Plots - Useful when a continuous attribute is measured on a spatial grid - They partition the plane into regions of similar values - The contour lines that form the boundaries of these regions connect points with equal values - The most common example is contour maps of elevation Celsius
  • 55. Parallel Coordinates - Plot the attribute values of high-dimensional data - Instead of using perpendicular axes, use a set of parallel axes - The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line, i.e., each object is represented as a line - The ordering of attributes is important
  • 57. Star Plots - Similar approach to parallel coordinates, but axes radiate from a central point - The line connecting the values of an object is a polygon
  • 59. Example - Other useful information, such as average values or thresholds, can also be encoded
  • 60. Chernoff Faces - Approach created by Herman Chernoff - Each attribute is associated with a characteristic of a face - Size of the face, shape of jaw, shape of forhead, etc. - The value of the attribute determines the appearance of the corresponding facial characteristic - Each object becomes a separate face - Relies on human’s ability to distinguish faces
  • 62. Principles for Visualization Quality - Apprehension to perceive relations among variables - Clarity to distinguish the most important elements - Consistency with previous, related graphs - Efficiency to show complex information in simple ways - Necessity of the graph, vs alternatives - Truthfulness when using magnitudes, relative to scales
  • 64.
  • 65.
  • 67. OLAP - Relational databases put data into tables, while OLAP uses a multidimensional array representation - Such representations of data previously existed in statistics and other fields - There are a number of data analysis and data exploration operations that are easier with such a data representation
  • 68. Converting Tabular Data - Two key steps in converting tabular data into a multidimensional array 1.Identify which attributes are to be the dimensions and which attribute is to be the target attribute - The attributes used as dimensions must have discrete values - The target value is typically a count or continuous value, e.g., the cost of an item - Can have no target variable at all except the count of objects that have the same set of attribute values
  • 69. Converting Tabular Data (2) 2.Find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry
  • 70. Example - Petal width and length are discretized to have categorical values: low, medium, and high
  • 71. Example - Each unique tuple of petal width, petal length, and species type identifies one element of the array
  • 72. Example - Cross-tabulations can be used to show slices of the multidimensional array
  • 73. Example - Cross-tabulations can be used to show slices of the multidimensional array
  • 74. Data Cube - The key operation of a OLAP is the formation of a data cube - A data cube is a multidimensional representation of data, together with all possible aggregates - Aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions
  • 75. Example - Consider a data set that records the sales of products at a number of company stores at various dates - This data can be represented 
 as a 3 dimensional array - There are 3 two-dimensional
 aggregates, 3 one-dimensional aggregates, and 1 zero-dimensional aggregate (the overall total)
  • 76. Example - This table shows one of the two dimensional aggregates, along with two of the one- dimensional aggregates, and the overall total
  • 77. OLAP Operations - Slicing - Dicing - Roll-up - Drill-down
  • 78. Slicing and Dicing - Slicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions - Dicing involves selecting a subset of cells by specifying a range of attribute values - This is equivalent to defining a subarray from the complete array - In practice, both operations can also be accompanied by aggregation over some dimensions
  • 80. Roll-up and Drill-down - Attribute values often have a hierarchical structure - Each date is associated with a year, month, and week - A location is associated with a continent, country, state (province, etc.), and city - Products can be divided into various categories, such as clothing, electronics, and furniture - These categories often nest and form a tree - A year contains months which contains day - A country contains a state which contains a city