Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Issues in data mining Patterns Online Analytical Processing
1. Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S.A.Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Data Mining and Warehousing (CO314)
Unit –I: Introduction to Data Mining
2. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2
Content
Kinds of pattern and technologies
Issues in mining
OLAP, knowledge representation, Information and Knowledge
3. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3
Kinds of pattern and technologies
We have observed various types of data and information repositories on which data mining can
be performed.
Let us now examine the kinds of patterns that can be mined.
There are a number of data mining functionalities.
These include characterization and discrimination the mining of frequent patterns, associations,
and correlations classification and regression ,clustering analysis; and outlier analysis
Data mining functionalities are used to specify the kinds of patterns to be found in data mining
tasks.
4. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4
Kinds of pattern and technologies
Pattern mining concentrates
on identifying rules that
describe specific patterns
within the data.
Market-basket analysis,
which identifies items that
typically occur together in
purchase transactions, was
one of the first applications
of data mining.
5. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5
Kinds of pattern and technologies
In general, such tasks can be classified into two categories:
Descriptive:
Descriptive mining tasks characterize properties of the data in a target data
set.
Predictive:
Predictive mining tasks perform induction on the current data in order to
make predictions.
6. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6
Class/Concept Description
Data entries can be associated with classes or concepts.
e.g. in the AllElectronics store, classes of items for sale include computers and printers, and
concepts of customers include bigSpenders and budgetSpenders.
It can be useful to describe individual classes and concepts in summarized, concise, and yet
precise terms. Such descriptions of a class or a concept are called class/concept descriptions.
These descriptions can be derived using:
(1) data characterization, by summarizing the data of the class under study (often called the target
class) in general terms, or
(2) data discrimination, by comparison of the target class with one or a set of comparative classes
(often called the contrasting classes), or
(3) both data characterization and discrimination.
7. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7
Data Characterization
In Data characterization Data entries can be associated with classes or concepts.
Data characterization is a summarization of the general characteristics or features of a target class of data.
The data corresponding to the user-specified class are typically collected by a query.
e.g. to study the characteristics of software products with sales that increased by 10% in the previous year,
the data related to such products can be collected by executing an SQL query on the sales database
The data cube-based OLAP roll-up operation can be used to perform user-controlled data summarization
along a specified dimension.
The output of data characterization can be presented in various forms. Examples include pie charts, bar
charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs.
The resulting descriptions can also be presented as generalized relations or in rule form (called
characteristic rules).
8. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8
Data Discrimination
Data discrimination is a comparison of the general features of the target class data objects
against the general features of objects from one or multiple contrasting classes.
The target and contrasting classes can be specified by a user, and the corresponding data objects
can be retrieved through database queries.
e.g. a user may want to compare the general features of software products with sales that
increased by 10% last year against those with sales that decreased by at least 30% during the same
period.
The methods used for data discrimination are similar to those used for data characterization.
The forms of output presentation are similar to those for characteristic descriptions, although
discrimination descriptions should include comparative measures that help to distinguish between
the target and contrasting classes.
Discrimination descriptions expressed in the form of rules are referred to as discriminant rules.
9. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data.
There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences
(also known as sequential patterns), and frequent substructures.
A frequent itemset typically refers to a set of items that often appear together in a transactional
data set— e.g. milk and bread, which are frequently bought together in grocery stores by many
customers.
A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a
laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be
combined with itemsets or subsequences.
If a substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and correlations within
data.
10. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10
Support and Confidence
As we know data mining refers to extracting or mining knowledge from large amounts of data.
In other words, Data mining is the science, art, and technology of discovering large and
complex bodies of data in order to discover useful patterns.
Support
In data mining, support refers to the relative frequency of an item set in a dataset.
e.g. if an itemset occurs in 5% of the transactions in a dataset, it has a support of 5%.
Support is often used as a threshold for identifying frequent item sets in a dataset,
which can be used to generate association rules.
e.g. if we set the support threshold to 5%, then any itemset that occurs in more than
5% of the transactions in the dataset will be considered a frequent itemset.
11. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11
Support and Confidence
Support
The support of an itemset is the number of transactions in which the itemset
appears, divided by the total number of transactions.
e.g. suppose we have a dataset of 1000 transactions, and the itemset {milk, bread}
appears in 100 of those transactions. The support of the itemset {milk, bread} would
be calculated as follows:
Support({milk, bread})
= Number of transactions containing {milk, bread} / Total number of transactions
= 100 / 1000 = 10%
12. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12
Confidence
Confidence
In data mining, confidence is a measure of the reliability or support for a given
association rule. It is defined as the proportion of cases in which the association
rule holds true, or in other words, the percentage of times that the items in the
antecedent (the “if” part of the rule) appear in the same transaction as the items
in the consequent (the “then” part of the rule).
Confidence is a measure of the likelihood that an itemset will appear if another
itemset appears.
13. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13
Confidence
E.g.
Confidence("If a customer buys milk, they will also buy bread")
= Number of transactions containing {milk, bread} / Number of transactions containing {milk}
= 100 / 200 = 50%
14. Introduction to Data
We frequently hear the words Data, Information and Knowledge used as if
they are the same thing.
Data is/are the facts of the World.
For example, take yourself. You may be 5ft tall, have brown hair and blue
eyes. All of this is “data”. You have brown hair whether this is written
down somewhere or not.
15. Data
In many ways, data can be thought of as a description of the World.
We can perceive this data with our senses, and then the brain can
process this.
16. Information
Information allows us to expand our knowledge beyond the range of our senses. We
can capture data in information, then move it about so that other people can access it
at different times.
If I take a picture of you, the photograph is information. But what you look like is data.
17. Knowledge
Knowledge is what we know. Think of this
as the map of the World we build inside our
brains.
Like a physical map, it helps us know
where things are – but it contains more
than that.
It also contains our beliefs and
expectations. “If I do this, I will probably
get that.”
Crucially, the brain links all these things
together into a giant network of ideas,
memories, predictions, beliefs, etc.
19. Online Analytical Processing (OLAP)
OLAP, or online analytical processing, is technology for performing high-speed complex
queries or multidimensional analysis on large volumes of data in a data
warehouse, data lake or other data repository.
OLAP is used in business intelligence (BI), decision support, and a variety of business
forecasting and reporting applications.
The core of most OLAP systems, the OLAP cube is an array-based multidimensional
database that makes it possible to process and analyze multiple data dimensions much
more quickly and efficiently than a traditional relational database.
In theory, a cube can contain an infinite number of layers. (An OLAP cube representing
more than three dimensions is sometimes called a hypercube.) And smaller cubes can
exist within layers—for example, each store layer could contain cubes arranging sales
by salesperson and product. In practice, data analysts will create OLAP cubes
containing just the layers they need, for optimal analysis and performance.
20. Online Analytical Processing (OLAP) cont…
Drill-down
The drill-down operation converts less-detailed data into more-detailed data
through one of two methods—moving down in the concept hierarchy or adding
a new dimension to the cube. For example, if you view sales data for an
organization’s calendar or fiscal quarter, you can drill-down to see sales for each
month, moving down in the concept hierarchy of the “time” dimension.
Roll up
Roll up is the opposite of the drill-down function—it aggregates data on an OLAP
cube by moving up in the concept hierarchy or by reducing the number of
dimensions. For example, you could move up in the concept hierarchy of the
“location” dimension by viewing each country's data, rather than each city.
21. Online Analytical Processing (OLAP) cont…
Slice and dice
The slice operation creates a sub-cube by selecting a single dimension from the
main OLAP cube. For example, you can perform a slice by highlighting all data for
the organization's first fiscal or calendar quarter (time dimension).
The dice operation isolates a sub-cube by selecting several dimensions within
the main OLAP cube. For example, you could perform a dice operation by
highlighting all data by an organization’s calendar or fiscal quarters (time
dimension) and within the U.S. and Canada (location dimension).
22. Online Analytical Processing (OLAP) cont…
Pivot
The pivot function rotates the current cube view to display a new representation
of the data—enabling dynamic multidimensional views of data.
The OLAP pivot function is comparable to the pivot table feature in spreadsheet
software, such as Microsoft Excel, but while pivot tables in Excel can be
challenging, OLAP pivots are relatively easier to use (less expertise is required)
and have a faster response time and query performance.