1. Data Ware Housing and Data Mining
Dr. Sunil Bhutada
Professor and Head
IT Department
2. About Me
Dr. Sunil Bhutada
B.E. (CSE), M.Tech (S/W Engg), Ph.D. (CSE)
Positions Held
1) 1994- 1998 Worked with Indian Rayon & Industries, Baroda
2) 1998-2005 Worked as a Asst Prof with Jatipita Engg College, Adilabad
3) 2005-2017 Worked as Associate Professor with Sreenidhi Institute of Science and Technology,
Hyderabad
4) 2017 onwards Working as Professor with Sreenidhi Institute of Science and Technology,
Hyderabad
5) 2021 onwards Working as Professor & Head with Sreenidhi Institute of Science and Technology,
Hyderabad
3. Syllabus – Unit 1
Introduction:
Fundamentals of data mining, KDD process,
Data Mining Functionalities, Classification of Data Mining systems,
Data Mining Task primitives,
Integration of a Data mining System with a Database or a Data warehouse systems,
Major issues in Data Mining.
Data Preprocessing:
Needs for Preprocessing the Data, Data Cleaning,
Data Integration and Transformation, Data Reduction,
Discretization and Concept Hierarchy Generation,
Data Mining Primitives, Data Mining Query Languages,
Architectures of Data Mining Systems.
4. TEXT BOOKS
1. Data mining: Concepts and Techniques, Jiawei Han
and Micheline Kamber, 2nd Edition, Elsevier, 2006.
2. Data Mining Techniques – ARUN K
PUJARI, University Press.
6. Information Hierarchy (Basic Concepts)
• Data :
The raw material of information
• Information :
Data organized and presented in a particular manner
• Knowledge :
“Justified true belief”. Information that can be acted upon
• Wisdom :
Distilled and integrated knowledge Demonstrative of high-level “understanding”
7. Information Hierarchy (A facetious Example)
• Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information
Hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F, …
• Knowledge
If you have a temperature above 100º F, you most likely have a fever
• Wisdom
If you don’t feel well, go see a doctor
8. Evolution of Database Technology
1960s: Data collection, database creation, IMS and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multimedia databases, and Web databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
11. • Corporate decision makers require access to all the
organization’s data, wherever it is located.
• To provide comprehensive analysis of the organization,
its business, its requirements and any trends, require
access to not only the current data in the database but
also to historical data.
• This course will involve an in-depth study of various
concepts needed to design and develop a data
warehouse.
• It also provides an introduction to data mining and end
user access tools for a data warehouse.
Why to study this subject?
12. • Business Intelligence (BI) is a process of gathering,
analyzing, and transforming raw data into accurate,
efficient, and meaningful information which can be
used to make wise business decisions and refine
business strategy.
• BI gives organizations a sense of clairvoyance.
• Business Intelligence testing initiatives help
companies gain deeper and better insights so they
can manage or make decisions based on hard facts
or data.
Business Intelligence
13. • DBMSs widely used to maintain transactional data
• Attempts to use of these data for analysis, exploration,
identification of trends etc. has led to Decision Support
Systems.
• Trend towards Data Warehousing
• Data Warehousing – consolidation of data from several
databases which are in turn maintained by individual
business units along with historical and summary
information
From DBMS to Decision Support
14. • A Data Warehousing (DW) is process for
collecting and managing data from varied sources
to provide meaningful business insights.
• A Data warehouse is typically used to connect
and analyze business data from heterogeneous
sources.
• The data warehouse is the core of the BI system
which is built for data analysis and reporting.
What is data warehouse?
15. Note:
A data warehouse does not require
transaction processing, recovery, and
concurrency controls, because it is
physically stored and separate from the
operational database
Features of Data Warehouse
16. Data warehouse system is also known by the
following name:
• Decision Support System (DSS)
• Executive Information System
• Management Information System
• Business Intelligence Solution
• Analytic Application
• Data Warehouse
Various versions of Data Warehouse
17. Growth of Data Warehouse
1960- Dartmouth and General Mills
1970- A Nielse (DM)
1983- Tera Data Corporation (DSS)
18. 18
What Is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Simple search and query processing
• (Deductive) expert systems
19. What is (not) Data Mining?
What is Data Mining?
– Certain names are more
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
What is not Data Mining?
– Look up phone number in
phone directory
– Query a Web search engine
for information about “Amazon”
20. Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/grocery stores
• Bank/Credit Card
transactions
• Twice as much information was created in 2002 as in 1999 (~30%
growth rate)
• Other growth rate estimates even higher
21. Largest databases in 2007
• Largest database in the world: World Data Centre for Climate (WDCC) operated by the
Max Planck Institute and German Climate Computing Centre
• 220 terabytes of data on climate research and climatic trends,
• 110 terabytes worth of climate simulation data.
• 6 petabytes worth of additional information stored on tapes.
• AT&T
• 323 terabytes of information
• 1.9 trillion phone call records
• Google
• 91 million searches per day,
• After a year worth of searches, this figure amounts to more than 33 trillion database
entries.
22. Why Mine Data? Scientific Viewpoint
• Data is collected and stored at enormous speeds (GB/hour).
E.g.
– remote sensors on a satellite
– telescopes scanning the skies
– scientific simulations
generating terabytes of data
• Very little data will ever be looked at by a human
• Knowledge Discovery is NEEDED to make sense and use of
data.
23. Data Mining
• Data mining is the process of automatically discovering useful information in large data
repositories.
• Human analysts may take weeks to discover useful information.
• Much of the data is never analyzed at all.
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of
analysts
24. Why Data Mining?—Potential Applications
• Data analysis and decision support
• Market analysis and management
• Target marketing, customer relationship management (CRM), market basket analysis,
cross selling, market segmentation
• Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control, competitive
analysis
• Fraud detection and detection of unusual patterns (outliers)
• Other Applications
• Text mining (news group, email, documents) and Web mining
• Stream data mining
• Bioinformatics and bio-data analysis
25. Knowledge Discovery (KDD) Process
• Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
26. Knowledge Discovery (KDD) Process – Several Key Steps
1. Data Cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
27. The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Developing an understanding of
the application domain
the relevant prior knowledge
the goals of the end-user
Creating a target data set: selecting a data set, or focusing on a subset of variables, or data
samples, on which discovery is to be performed.
Data cleaning and preprocessing.
Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
28. The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Data reduction and projection.
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the effective number
of variables under consideration or to find invariant representations for the data.
Choosing the data mining task.
Deciding whether the goal of the KDD process is classification, regression, clustering,
etc.
29. The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Choosing the data mining algorithm(s).
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the KDD process.
Data mining.
Searching for patterns of interest in a particular representational form or a set of such
representations as classification rules or trees, regression, clustering, and so forth.
Interpreting mined patterns.
Consolidating discovered knowledge.
30. Data Mining and Business Intelligence
30
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
31. Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
32. Origins of Data Mining
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
• Draws ideas from machine learning/AI, pattern recognition, statistics,
and database systems
• Traditional Techniques
may be unsuitable due to
• Enormity of data
• High dimensionality of data
• Heterogeneous,
distributed nature of data
33. Data Mining Functionalities (1)
• Concept description: Characterization and discrimination
• Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
• Data characterization is summarization of the general characteristics or features of a target class of data.
• For ex, to study the characteristics of software products whose sales increased by 10% in the last year, the
data related to such products can be collected by SQL query
• Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. The target and contrasting classes can be
specified by the user.
• For ex : the user may like to compare the general features of software products whose sales increased by
10% in the last year with those whose sales decreased by at least 30% during the same period.
34. Data Mining Functionalities (1)
• Association Analysis (correlation and causality)
• Association Analysis is the discovery of association rules showing attribute value
conditions that occur frequently together in a given set of data. Association is widely
used for market basket or transaction data analysis.
• Multi-dimensional vs. single-dimensional association
• age(X, “20..29”) ^ income(X, “20..29K”) à buys (X, “PC”)
• [support = 2%, confidence = 60%]
• The number of times, this item set appears in the database is called its "support"
• Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A
has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has
occurred
• zcontains(T, “computer”) à contains(x, “software”) [1%, 75%]
35. Data Mining Functionalities (2)
• Classification and Prediction
• Finding models (functions) that describe and distinguish classes or concepts for future
prediction
• E.g., classify countries based on climate, or classify cars based on gas mileage
• Presentation: decision-tree, classification rule, neural network
• Prediction: Predict some unknown or missing numerical values
• Cluster analysis
• Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
• Clustering based on the principle: maximizing the intra-class similarity and minimizing the
interclass similarity
36. Data Mining Functionalities (3)
• Outlier analysis
• Outlier: a data object that does not comply with the general behavior of
the data
• It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Trend and evolution analysis
• Trend and deviation: regression analysis
• Sequential pattern mining, periodicity analysis
• Similarity-based analysis
• Other pattern-directed or statistical analyses
37. Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of patterns, not all of them are
interesting.
• Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures: A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
•
• Objective vs. subjective interestingness measures:
• Objective: based on statistics and structures of patterns, e.g., support, confidence,
etc.
• Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, action
ability, etc.
38. Can We Find All and Only Interesting Patterns?
• Find all the interesting patterns: Completeness
• Can a data mining system find all the interesting patterns?
• Association vs. classification vs. clustering
• Search for only interesting patterns: Optimization
• Can a data mining system find only the interesting patterns?
• Approaches
• First general all the patterns and then filter out the uninteresting ones.
• Generate only the interesting patterns—mining query optimization
39. Integration of a Data Mining System with a DB or DWH System
• Critical question in the design of a data mining (DM) system is how to integrate or couple
the DM system with a database (DB) system and/or a data warehouse (DW) system.
Four ways we can integrate
i. No Coupling,
ii. Loose Coupling
iii. Semi Tight Coupling
iv. Tight Coupling
40. No Coupling:
• Means that a DM system will not utilize any function of a DB or DW system.
• It may fetch data from a particular source (such as a file system), process data using some
data mining algorithms, and then store the mining results in another file.
• Such a system, though simple, suffers from several drawbacks.
• First, a DB system provides a great deal of flexibility and efficiency at storing, organizing,
accessing, and processing data. Without using a DB/DW system, a DM system may spend
a substantial amount of time finding, collecting, cleaning, and transforming data.
• In DB and/or DW systems, data tend to be well organized, indexed, cleaned, integrated, or
consolidated, so that finding the task-relevant, high-quality data becomes an easy task.
41. No Coupling:
• Second,
• There are many tested, scalable algorithms and data structures implemented in DB and DW
systems. It is feasible to realize efficient, scalable implementations using such systems.
• Moreover, most data have been or will be stored in DB/DW systems.
• Without any coupling of such systems, a DM system will need to use other tools to extract
data, making it difficult to integrate such a system into an information processing
environment. Thus, no coupling represents a poor design.
42. Loss Coupling:
• Loose coupling means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository managed by these systems, performing data mining,
and then storing the mining results either in a file or in a designated place in a database or
data warehouse.
• Loose coupling is better than no coupling because it can fetch any portion of data stored
in databases or data warehouses by using query processing, indexing, and other system
facilities.
• It incurs some advantages of the flexibility, efficiency, and other features provided by
such systems.
• However, many loosely coupled mining systems are main memory-based. Because mining
does not explore data structures and query optimization methods provided by DB or DW
systems, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
43. Semi Tight Coupling:
• Semitight coupling means that besides linking a DM system to a DB/Dw system,
efficient implementations of a few essential data mining primitives (frequently
encountered data mining functions) can be provided in the DB/DW system.
• These primitives can include sorting, indexing, aggregation, histogram analysis,
multiway join, and precomputation of some essential statistical measures, such as
sum, count, max, min, standard deviation, and so on.
44. Tight Coupling:
• Tight coupling: Tight coupling means that a DM system is smoothly integrated into
the DB/DW system.
• The data mining subsystem is treated as one functional component of an information
system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB
or DW system.
45. Major Issues in Data Mining
• Mining methodology
• Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
• Performance: efficiency, effectiveness, and scalability
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge
• Handling noise and incomplete data
• Parallel, distributed and incremental mining methods
• Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
• Data mining query languages and ad-hoc mining
• Expression and visualization of data mining results
• Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
• Domain-specific data mining & invisible data mining
• Protection of data security, integrity, and privacy
46. Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
47. Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
49. Data Cleaning
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
50. Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
51. How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter
• Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree
52. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
53. How to Handle Noisy Data?
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
• Regression
• smooth by fitting the data into regression functions
54. Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of
samples
• Good data scaling
• Managing categorical attributes can be tricky.
Smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
55. Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
56. Data Integration
• Data integration:
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources
• Entity identification problem: identify real world entities from multiple data
sources.
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are
different
• possible reasons: different representations, different scales, e.g., metric vs.
British units
57. Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple databases
• The same attribute may have different names in different databases
• One attribute may be a “derived” attribute in another table, e.g., annual
revenue
• Redundant data may be able to be detected by correlational analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
58. Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Attribute/feature construction
• New attributes constructed from the given ones
59. Data Transformation: Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
'
A
A
dev
stand
mean
v
v
_
'
j
v
v
10
' Where j is the smallest integer such that Max(| |)<1
60. Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining may take a
very long time to run on the complete data set
Data reduction
Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
61. Discretization and Concept Hierarchy
Discretization
Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals.
Interval labels can then be used to replace actual data values.
Concept hierarchies
Reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).
62. Discretization https://www.intellspot.com/data-types/
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
63. Discretization and concept hierarchy generation for numeric data
• Binning
• Histogram analysis
• Clustering analysis
• Entropy-based discretization
• Segmentation by natural partitioning
64. Concept hierarchy generation for categorical data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Specification of a portion of a hierarchy by explicit data grouping
• Specification of a set of attributes, but not of their partial ordering
• Specification of only a partial set of attributes
65. Specification of a set of attributes
Concept hierarchy can be automatically generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest level of the
hierarchy.
January 7, 2024
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values
66. Data Mining Primitives
Misconception: Data mining systems can autonomously dig out all of the
valuable knowledge from a given large database, without human intervention.
If there was no user intervention then the system would uncover a large set of
patterns that may even surpass the size of the database. Hence, user
interference is required.
This user communication with the system is provided by using a set of data
mining primitives.
67. Why Data Mining Primitives and Languages?
Task-relevant data :
What is the data set I want to mine?
Type of knowledge to be mined :
What kind of knowledge do I want to mine ?
Background knowledge :
What background knowledge could be useful here ?
Pattern interestingness measurements :
What measures can be useful to estimate pattern interestingness ?
Visualization of discovered patterns :
How do I want the discovered patterns to be presented ?
68. Data Mining Primitives :
What Defines a Data Mining Task ?
• A popular misconception about data mining is to except that data mining systems can
autonomously dig out all of the valuable knowledge and patterns that is embedded in
large database, without human intervention or guidance.
• Finding all the patterns autonomously in a database? — unrealistic because the
patterns could be too many but uninteresting
• Data mining should be an interactive process
• User directs what to be mined
• Users must be provided with a set of primitives to be used to communicate with the
data mining system
• Incorporating these primitives in a data mining query language
• More flexible user interaction
• Foundation for design of graphical user interface
• Standardization of data mining industry and practice
70. Task-Relevant Data (Minable View)
• The first primitive is the specification of the data on which mining is to be performed.
• Typically, a user is interested in only a subset of the database. It is impractical to mine the
entire database, particularly since the number of patterns generated could be exponential
w.r.t the database size.
• Furthermore, many of the patterns found would be irrelevant to the interests of the user.
• In a relational database, the set of task relevant data can be collected via a relational
query involving operations like selection, projection, join and aggregation.
• This retrieval of data can be thought of as a “subtask” of the data mining task. The data
collection process results in a new data relational called the initial data relation
71. Task-Relevant Data (Minable View)
• The initial data relation can be ordered or grouped according to the conditions specified in
the query.
• The data may be cleaned or transformed (e.g. aggregated on certain attributes) prior to
applying data mining analysis.
• This initial relation may or may not correspond to physical relation in the database.
• Since virtual relations are called Views in the field of databases, the set of task-relevant
data for data mining is called a minable view
• If data mining task is to study associations between items frequently purchased at
AllElectronics by customers in Canada, the task relevant data can be specified by
providing the following information
72. Task-Relevant Data (Minable View)
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria
73. Task-Relevant Data (Minable View)
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria
• Data portion to be investigated.
• Attributes of interest (relevant attributes) can be specified.
• Initial data relation
• Minable view
74. Task-Relevant Data (Minable View)
If a data mining task is to study associations between items frequently purchased at All Electronics by
customers in Canada, the task relevant data can be specified by providing the following information:
Name of the database or data warehouse to be used (e.g., AllElectronics_db)
Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases and
items_sold)
Conditions for selecting the relevant data (e.g., retrieve data pertaining to
purchases made in Canada for the current year)
The relevant attributes or dimensions (e.g., name and price from the item table and income and
age from the customer table)
75. The kind of knowledge to be mined
It is important to specify the kind of knowledge to be mined, as this determines the data mining
functions to be performed.
The kinds of knowledge include concept description (characterization and discrimination), association,
classification, predication, clustering, and evolution analysis.
In addition to specifying the kind of knowledge to be mined for a given data mining task, the user can
be more specific and provide pattern templates that all discovered patterns must match
76. The kind of knowledge to be mined
These templates, or metapatterns (also called metarules or metaqueries), can be used to guide the
discovery process. The use of metapatterns is illustrated in the following example.
A user studying the buying habits of Allelectronics customers may choose to mine association rules of
the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)
Here X is a key of the customer relations, P & Q are predicate variables and W,Y and Z are object
variables
[1.4%, 70%]
77. The kind of knowledge to be mined
The search for association rules is confined to those matching the given metarule, such as
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”)
[2.2%, 60%] and
occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”)
[1.4%, 70%]
The former rule states that customers in their thirties, with an annual income of between 40K and 49K,
are likely (with 60% confidence) to purchase a VCR, and such cases represent about 2.2.% of the
total number of transactions.
The latter rule states that customers who are students and in their twenties are likely (with 70%
confidence) to purchase a computer, and such cases represent about 1.4% of the total number of
transactions.
78. Types of knowledge to be mined
• Characterization
• Discrimination
• Association
• Classification/prediction
• Clustering
• Outlier analysis
• Other data mining tasks
79. Summary
• Data preparation is a big issue for both warehousing and mining
• Data preparation includes
• Data cleaning and data integration
• Data reduction and feature selection
• Discretization
• A lot a methods have been developed but still an active area of
research
80. January 7, 2024
80
Summary
• Data preparation is a big issue for both warehousing and mining
• Data preparation includes
• Data cleaning and data integration
• Data reduction and feature selection
• Discretization
• A lot a methods have been developed but still an active area of research
81. 81
What is Data Warehousing?
A process of transforming data into
information and making it available to
users in a timely enough manner to
make a difference
[Forrester Research, April 1996]
Data
Information
82. 82
Very Large Data Bases
• Terabytes -- 10^12 bytes:
• Petabytes -- 10^15 bytes:
• Exabytes -- 10^18 bytes:
• Zettabytes -- 10^21 bytes:
• Zottabytes -- 10^24 bytes:
Walmart -- 24 Terabytes
Geographic Information Systems
National Medical Records
Weather images
Intelligence Agency Videos
83. 83
What is a Data Warehouse?
A single, complete and consistent store of data
obtained from a variety of different sources made
available to end users in a what they can
understand and use in a business context.
[Barry Devlin]
84. 84
Data Warehousing -- It is a process
• Technique for assembling and managing data from
various sources for the purpose of answering business
questions. Thus making decisions that were not
previous possible
• A decision support database maintained separately
from the organization’s operational database
85. 85
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
• A decision support database that is maintained separately from the organization’s
operational database
• Support information processing by providing a solid platform of consolidated, historical data
for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon
• Data warehousing:
• The process of constructing and using data warehouses
86. 86
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product, sales.
• Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing.
• Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
87. 87
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.
88. 88
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer than that of
operational systems.
• Operational database: current value data.
• Data warehouse data: provide information from a historical perspective (e.g., past 5-
10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”.
89. 89
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the operational
environment.
• Operational update of data does not occur in the data warehouse
environment.
• Does not require transaction processing, recovery, and concurrency control
mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
90. 90
Data Warehouse vs. Heterogeneous DBMS
• Traditional heterogeneous DB integration:
• Build wrappers/mediators on top of heterogeneous databases
• Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to translate the query into
queries appropriate for individual heterogeneous sites involved, and the results are integrated
into a global answer set
• Complex information filtering, compete for resources
• Data warehouse: update-driven, high performance
• Information from heterogeneous sources is integrated in advance and stored in warehouses for direct
query and analysis
91. 91
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration,
accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
92. 92
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
93. 93
Why Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
• Different functions and different data:
• missing data: Decision support requires historical data which operational DBs do not
typically maintain
• data consolidation: DS requires consolidation (aggregation, summarization) of data
from heterogeneous sources
• data quality: different sources typically use inconsistent data representations, codes
and formats which have to be reconciled
94. 94
Typical Process Flow Within a Data Warehouse
Archive data
Figure : Process flow within a data warehouse
Data transformation
and movement
Source Warehouse Users
Extract And load Query
95. 95
1. Extract and load the data
2. Clean and transform data into a form that can cope with large data
volumes and provide good query performance.
3. Back up and archive data
4. Manage queries and direct them to the appropriate data sources
96. 96
Extract and Load Process
1. Controlling the Process
- Determine when to start extracting the data
2. When to initiate the extract
- Data should be in a consistent state
- Start extracting data from data sources when it represents the same snapshot of time as
all the other data sources
3. Loading the data
- Do not execute consistency checks until all the data sources have been loaded
into the temporary data store
4. Copy Management Tools and Data cleanup
97. 97
Clean and Transform Data
1. Clean and Transform the data
Data needs to be cleaned and checked in the following ways:
- Make sure data is consistent within itself
- Make sure that data is consistent with other data within the same source
- Make sure data is consistent with data in the other source systems.
- Make sure data is consistent with the information already in the warehouse
98. 98
2. Transforming into Effective Structure
- Once the data has been cleaned, convert the source data in the
temporary data store into a structure that is designed to balance
query performance and operational cost
99. 99
Backup and Archive Process
• The data within the data warehouse is backed up regularly
in order to ensure that the data warehouse can always be
recovered from data loss, software failure or hardware
failure.
100. 100
Query Management Process
• System process that manages the queries an speeds them up by directing queries to
the most effective data source.
• Directing Queries to the suitable tables
• Maximizing System Resources
• Query Capture
- Query profiles change on a regular basis
- In order to accurately monitor and understand what the new query profiles are, it
can be very effective to capture the physical queries that are being executed.
101. 101
Design of a Data Warehouse: A Business
Analysis Framework
• Four views regarding the design of a data warehouse
• Top-down view
• allows selection of the relevant information necessary for the data warehouse
• Data source view
• exposes the information being captured, stored, and managed by operational systems
• Data warehouse view
• consists of fact tables and dimension tables
• Business query view
• sees the perspectives of data in the warehouse from the view of end-user
102. 102
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
• Top-down: Starts with overall design and planning (mature)
• Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
• Waterfall: structured and systematic analysis at each step before proceeding to the next
• Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn
around
• Typical data warehouse design process
• Choose a business process to model, e.g., orders, invoices, etc.
• Choose the grain (atomic level of data) of the business process
• Choose the dimensions that will apply to each fact table record
• Choose the measure that will populate each fact table record
104. 104
Three Data Warehouse Models
• Enterprise warehouse
• collects all of the information about subjects spanning the entire organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
105. 105
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed Data
Marts
Multi-Tier Data
Warehouse
Enterprise Data
Warehouse
Model refinement
Model refinement
106. 106
DMQL—A Data Mining Query Language
• Motivation
• A DMQL can provide the ability to support ad-hoc and interactive data mining
• By providing a standardized language like SQL
• Hope to achieve a similar effect like that SQL has on relational database
• Foundation for system development and evolution
• Facilitate information exchange, technology transfer, commercialization and wide
acceptance
108. 108
Integration of Data Mining and Data Warehousing
• Data mining systems, DBMS, Data warehouse systems coupling
• No coupling, loose-coupling, semi-tight-coupling, tight-coupling
• On-line analytical mining data
• integration of mining and OLAP technologies
• Interactive mining multi-level knowledge
• Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling,
pivoting, slicing/dicing, etc.
• Integration of multiple mining functions
• Characterized classification, first clustering and then association
109. 109
Coupling Data Mining with DB/DW Systems
• No coupling—flat file processing, not recommended
• Loose coupling
• Fetching data from DB/DW
• Semi-tight coupling—enhanced DM performance
• Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting,
indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions
• Tight coupling—A uniform information processing environment
• DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining
query, indexing, query processing methods, etc.
110. 110
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowl
edge-
Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
111. 10 Open Source ETL Tools
https://www.datasciencecentral.com/profiles/blogs/10-open-source-etl-tools