SlideShare a Scribd company logo
1 of 112
Data Ware Housing and Data Mining
Dr. Sunil Bhutada
Professor and Head
IT Department
About Me
Dr. Sunil Bhutada
B.E. (CSE), M.Tech (S/W Engg), Ph.D. (CSE)
Positions Held
1) 1994- 1998 Worked with Indian Rayon & Industries, Baroda
2) 1998-2005 Worked as a Asst Prof with Jatipita Engg College, Adilabad
3) 2005-2017 Worked as Associate Professor with Sreenidhi Institute of Science and Technology,
Hyderabad
4) 2017 onwards Working as Professor with Sreenidhi Institute of Science and Technology,
Hyderabad
5) 2021 onwards Working as Professor & Head with Sreenidhi Institute of Science and Technology,
Hyderabad
Syllabus – Unit 1
Introduction:
Fundamentals of data mining, KDD process,
Data Mining Functionalities, Classification of Data Mining systems,
Data Mining Task primitives,
Integration of a Data mining System with a Database or a Data warehouse systems,
Major issues in Data Mining.
Data Preprocessing:
Needs for Preprocessing the Data, Data Cleaning,
Data Integration and Transformation, Data Reduction,
Discretization and Concept Hierarchy Generation,
Data Mining Primitives, Data Mining Query Languages,
Architectures of Data Mining Systems.
TEXT BOOKS
1. Data mining: Concepts and Techniques, Jiawei Han
and Micheline Kamber, 2nd Edition, Elsevier, 2006.
2. Data Mining Techniques – ARUN K
PUJARI, University Press.
Information Hierarchy (Basic Concepts)
Information Hierarchy (Basic Concepts)
• Data :
The raw material of information
• Information :
Data organized and presented in a particular manner
• Knowledge :
“Justified true belief”. Information that can be acted upon
• Wisdom :
Distilled and integrated knowledge Demonstrative of high-level “understanding”
Information Hierarchy (A facetious Example)
• Data
98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information
Hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F, …
• Knowledge
If you have a temperature above 100º F, you most likely have a fever
• Wisdom
If you don’t feel well, go see a doctor
Evolution of Database Technology
1960s: Data collection, database creation, IMS and network DBMS
1970s: Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multimedia databases, and Web databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
Storage then and Now
• Corporate decision makers require access to all the
organization’s data, wherever it is located.
• To provide comprehensive analysis of the organization,
its business, its requirements and any trends, require
access to not only the current data in the database but
also to historical data.
• This course will involve an in-depth study of various
concepts needed to design and develop a data
warehouse.
• It also provides an introduction to data mining and end
user access tools for a data warehouse.
Why to study this subject?
• Business Intelligence (BI) is a process of gathering,
analyzing, and transforming raw data into accurate,
efficient, and meaningful information which can be
used to make wise business decisions and refine
business strategy.
• BI gives organizations a sense of clairvoyance.
• Business Intelligence testing initiatives help
companies gain deeper and better insights so they
can manage or make decisions based on hard facts
or data.
Business Intelligence
• DBMSs widely used to maintain transactional data
• Attempts to use of these data for analysis, exploration,
identification of trends etc. has led to Decision Support
Systems.
• Trend towards Data Warehousing
• Data Warehousing – consolidation of data from several
databases which are in turn maintained by individual
business units along with historical and summary
information
From DBMS to Decision Support
• A Data Warehousing (DW) is process for
collecting and managing data from varied sources
to provide meaningful business insights.
• A Data warehouse is typically used to connect
and analyze business data from heterogeneous
sources.
• The data warehouse is the core of the BI system
which is built for data analysis and reporting.
What is data warehouse?
Note:
A data warehouse does not require
transaction processing, recovery, and
concurrency controls, because it is
physically stored and separate from the
operational database
Features of Data Warehouse
Data warehouse system is also known by the
following name:
• Decision Support System (DSS)
• Executive Information System
• Management Information System
• Business Intelligence Solution
• Analytic Application
• Data Warehouse
Various versions of Data Warehouse
Growth of Data Warehouse
1960- Dartmouth and General Mills
1970- A Nielse (DM)
1983- Tera Data Corporation (DSS)
18
What Is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Simple search and query processing
• (Deductive) expert systems
What is (not) Data Mining?
 What is Data Mining?
– Certain names are more
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
 What is not Data Mining?
– Look up phone number in
phone directory
– Query a Web search engine
for information about “Amazon”
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/grocery stores
• Bank/Credit Card
transactions
• Twice as much information was created in 2002 as in 1999 (~30%
growth rate)
• Other growth rate estimates even higher
Largest databases in 2007
• Largest database in the world: World Data Centre for Climate (WDCC) operated by the
Max Planck Institute and German Climate Computing Centre
• 220 terabytes of data on climate research and climatic trends,
• 110 terabytes worth of climate simulation data.
• 6 petabytes worth of additional information stored on tapes.
• AT&T
• 323 terabytes of information
• 1.9 trillion phone call records
• Google
• 91 million searches per day,
• After a year worth of searches, this figure amounts to more than 33 trillion database
entries.
Why Mine Data? Scientific Viewpoint
• Data is collected and stored at enormous speeds (GB/hour).
E.g.
– remote sensors on a satellite
– telescopes scanning the skies
– scientific simulations
generating terabytes of data
• Very little data will ever be looked at by a human
• Knowledge Discovery is NEEDED to make sense and use of
data.
Data Mining
• Data mining is the process of automatically discovering useful information in large data
repositories.
• Human analysts may take weeks to discover useful information.
• Much of the data is never analyzed at all.
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of
analysts
Why Data Mining?—Potential Applications
• Data analysis and decision support
• Market analysis and management
• Target marketing, customer relationship management (CRM), market basket analysis,
cross selling, market segmentation
• Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control, competitive
analysis
• Fraud detection and detection of unusual patterns (outliers)
• Other Applications
• Text mining (news group, email, documents) and Web mining
• Stream data mining
• Bioinformatics and bio-data analysis
Knowledge Discovery (KDD) Process
• Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Knowledge Discovery (KDD) Process – Several Key Steps
1. Data Cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Developing an understanding of
the application domain
the relevant prior knowledge
the goals of the end-user
Creating a target data set: selecting a data set, or focusing on a subset of variables, or data
samples, on which discovery is to be performed.
Data cleaning and preprocessing.
Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Data reduction and projection.
Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the effective number
of variables under consideration or to find invariant representations for the data.
Choosing the data mining task.
Deciding whether the goal of the KDD process is classification, regression, clustering,
etc.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
Choosing the data mining algorithm(s).
Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the KDD process.
Data mining.
Searching for patterns of interest in a particular representational form or a set of such
representations as classification rules or trees, regression, clustering, and so forth.
Interpreting mined patterns.
Consolidating discovered knowledge.
Data Mining and Business Intelligence
30
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithm
Other
Disciplines
Visualization
Origins of Data Mining
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
• Draws ideas from machine learning/AI, pattern recognition, statistics,
and database systems
• Traditional Techniques
may be unsuitable due to
• Enormity of data
• High dimensionality of data
• Heterogeneous,
distributed nature of data
Data Mining Functionalities (1)
• Concept description: Characterization and discrimination
• Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
• Data characterization is summarization of the general characteristics or features of a target class of data.
• For ex, to study the characteristics of software products whose sales increased by 10% in the last year, the
data related to such products can be collected by SQL query
• Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. The target and contrasting classes can be
specified by the user.
• For ex : the user may like to compare the general features of software products whose sales increased by
10% in the last year with those whose sales decreased by at least 30% during the same period.
Data Mining Functionalities (1)
• Association Analysis (correlation and causality)
• Association Analysis is the discovery of association rules showing attribute value
conditions that occur frequently together in a given set of data. Association is widely
used for market basket or transaction data analysis.
• Multi-dimensional vs. single-dimensional association
• age(X, “20..29”) ^ income(X, “20..29K”) à buys (X, “PC”)
• [support = 2%, confidence = 60%]
• The number of times, this item set appears in the database is called its "support"
• Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A
has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has
occurred
• zcontains(T, “computer”) à contains(x, “software”) [1%, 75%]
Data Mining Functionalities (2)
• Classification and Prediction
• Finding models (functions) that describe and distinguish classes or concepts for future
prediction
• E.g., classify countries based on climate, or classify cars based on gas mileage
• Presentation: decision-tree, classification rule, neural network
• Prediction: Predict some unknown or missing numerical values
• Cluster analysis
• Class label is unknown: Group data to form new classes, e.g., cluster houses to find
distribution patterns
• Clustering based on the principle: maximizing the intra-class similarity and minimizing the
interclass similarity
Data Mining Functionalities (3)
• Outlier analysis
• Outlier: a data object that does not comply with the general behavior of
the data
• It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Trend and evolution analysis
• Trend and deviation: regression analysis
• Sequential pattern mining, periodicity analysis
• Similarity-based analysis
• Other pattern-directed or statistical analyses
Are All the “Discovered” Patterns Interesting?
• A data mining system/query may generate thousands of patterns, not all of them are
interesting.
• Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures: A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm
•
• Objective vs. subjective interestingness measures:
• Objective: based on statistics and structures of patterns, e.g., support, confidence,
etc.
• Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, action
ability, etc.
Can We Find All and Only Interesting Patterns?
• Find all the interesting patterns: Completeness
• Can a data mining system find all the interesting patterns?
• Association vs. classification vs. clustering
• Search for only interesting patterns: Optimization
• Can a data mining system find only the interesting patterns?
• Approaches
• First general all the patterns and then filter out the uninteresting ones.
• Generate only the interesting patterns—mining query optimization
Integration of a Data Mining System with a DB or DWH System
• Critical question in the design of a data mining (DM) system is how to integrate or couple
the DM system with a database (DB) system and/or a data warehouse (DW) system.
Four ways we can integrate
i. No Coupling,
ii. Loose Coupling
iii. Semi Tight Coupling
iv. Tight Coupling
No Coupling:
• Means that a DM system will not utilize any function of a DB or DW system.
• It may fetch data from a particular source (such as a file system), process data using some
data mining algorithms, and then store the mining results in another file.
• Such a system, though simple, suffers from several drawbacks.
• First, a DB system provides a great deal of flexibility and efficiency at storing, organizing,
accessing, and processing data. Without using a DB/DW system, a DM system may spend
a substantial amount of time finding, collecting, cleaning, and transforming data.
• In DB and/or DW systems, data tend to be well organized, indexed, cleaned, integrated, or
consolidated, so that finding the task-relevant, high-quality data becomes an easy task.
No Coupling:
• Second,
• There are many tested, scalable algorithms and data structures implemented in DB and DW
systems. It is feasible to realize efficient, scalable implementations using such systems.
• Moreover, most data have been or will be stored in DB/DW systems.
• Without any coupling of such systems, a DM system will need to use other tools to extract
data, making it difficult to integrate such a system into an information processing
environment. Thus, no coupling represents a poor design.
Loss Coupling:
• Loose coupling means that a DM system will use some facilities of a DB or DW system,
fetching data from a data repository managed by these systems, performing data mining,
and then storing the mining results either in a file or in a designated place in a database or
data warehouse.
• Loose coupling is better than no coupling because it can fetch any portion of data stored
in databases or data warehouses by using query processing, indexing, and other system
facilities.
• It incurs some advantages of the flexibility, efficiency, and other features provided by
such systems.
• However, many loosely coupled mining systems are main memory-based. Because mining
does not explore data structures and query optimization methods provided by DB or DW
systems, it is difficult for loose coupling to achieve high scalability and good
performance with large data sets.
Semi Tight Coupling:
• Semitight coupling means that besides linking a DM system to a DB/Dw system,
efficient implementations of a few essential data mining primitives (frequently
encountered data mining functions) can be provided in the DB/DW system.
• These primitives can include sorting, indexing, aggregation, histogram analysis,
multiway join, and precomputation of some essential statistical measures, such as
sum, count, max, min, standard deviation, and so on.
Tight Coupling:
• Tight coupling: Tight coupling means that a DM system is smoothly integrated into
the DB/DW system.
• The data mining subsystem is treated as one functional component of an information
system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB
or DW system.
Major Issues in Data Mining
• Mining methodology
• Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
• Performance: efficiency, effectiveness, and scalability
• Pattern evaluation: the interestingness problem
• Incorporation of background knowledge
• Handling noise and incomplete data
• Parallel, distributed and incremental mining methods
• Integration of the discovered knowledge with existing one: knowledge fusion
• User interaction
• Data mining query languages and ad-hoc mining
• Expression and visualization of data mining results
• Interactive mining of knowledge at multiple levels of abstraction
• Applications and social impacts
• Domain-specific data mining & invisible data mining
• Protection of data security, integrity, and privacy
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
Forms of data preprocessing
Data Cleaning
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer
income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
• Fill in the missing value manually: tedious + infeasible
• Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same class to fill in the
missing value: smarter
• Use the most probable value to fill in the missing value: inference-based such as
Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?
• Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human
• Regression
• smooth by fitting the data into regression functions
Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the width of intervals will
be: W = (B-A)/N.
• The most straightforward
• But outliers may dominate presentation
• Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
• It divides the range into N intervals, each containing approximately same number of
samples
• Good data scaling
• Managing categorical attributes can be tricky.
Smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
• Data integration:
• combines data from multiple sources into a coherent store
• Schema integration
• integrate metadata from different sources
• Entity identification problem: identify real world entities from multiple data
sources.
• Detecting and resolving data value conflicts
• for the same real world entity, attribute values from different sources are
different
• possible reasons: different representations, different scales, e.g., metric vs.
British units
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple databases
• The same attribute may have different names in different databases
• One attribute may be a “derived” attribute in another table, e.g., annual
revenue
• Redundant data may be able to be detected by correlational analysis
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality
Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
Attribute/feature construction
• New attributes constructed from the given ones
Data Transformation: Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that Max(| |)<1
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining may take a
very long time to run on the complete data set
Data reduction
Obtains a reduced representation of the data set that is much smaller in
volume but yet produces the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
Discretization and Concept Hierarchy
Discretization
Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals.
Interval labels can then be used to replace actual data values.
Concept hierarchies
Reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).
Discretization https://www.intellspot.com/data-types/
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Prepare for further analysis
Discretization and concept hierarchy generation for numeric data
• Binning
• Histogram analysis
• Clustering analysis
• Entropy-based discretization
• Segmentation by natural partitioning
Concept hierarchy generation for categorical data
• Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
• Specification of a portion of a hierarchy by explicit data grouping
• Specification of a set of attributes, but not of their partial ordering
• Specification of only a partial set of attributes
Specification of a set of attributes
Concept hierarchy can be automatically generated based on the number of distinct
values per attribute in the given attribute set.
The attribute with the most distinct values is placed at the lowest level of the
hierarchy.
January 7, 2024
country
province_or_ state
city
street
15 distinct values
65 distinct values
3567 distinct values
674,339 distinct values
Data Mining Primitives
 Misconception: Data mining systems can autonomously dig out all of the
valuable knowledge from a given large database, without human intervention.
 If there was no user intervention then the system would uncover a large set of
patterns that may even surpass the size of the database. Hence, user
interference is required.
 This user communication with the system is provided by using a set of data
mining primitives.
Why Data Mining Primitives and Languages?
Task-relevant data :
What is the data set I want to mine?
Type of knowledge to be mined :
What kind of knowledge do I want to mine ?
Background knowledge :
What background knowledge could be useful here ?
Pattern interestingness measurements :
What measures can be useful to estimate pattern interestingness ?
Visualization of discovered patterns :
How do I want the discovered patterns to be presented ?
Data Mining Primitives :
What Defines a Data Mining Task ?
• A popular misconception about data mining is to except that data mining systems can
autonomously dig out all of the valuable knowledge and patterns that is embedded in
large database, without human intervention or guidance.
• Finding all the patterns autonomously in a database? — unrealistic because the
patterns could be too many but uninteresting
• Data mining should be an interactive process
• User directs what to be mined
• Users must be provided with a set of primitives to be used to communicate with the
data mining system
• Incorporating these primitives in a data mining query language
• More flexible user interaction
• Foundation for design of graphical user interface
• Standardization of data mining industry and practice
Primitives for specifying a data mining task
Task-Relevant Data (Minable View)
• The first primitive is the specification of the data on which mining is to be performed.
• Typically, a user is interested in only a subset of the database. It is impractical to mine the
entire database, particularly since the number of patterns generated could be exponential
w.r.t the database size.
• Furthermore, many of the patterns found would be irrelevant to the interests of the user.
• In a relational database, the set of task relevant data can be collected via a relational
query involving operations like selection, projection, join and aggregation.
• This retrieval of data can be thought of as a “subtask” of the data mining task. The data
collection process results in a new data relational called the initial data relation
Task-Relevant Data (Minable View)
• The initial data relation can be ordered or grouped according to the conditions specified in
the query.
• The data may be cleaned or transformed (e.g. aggregated on certain attributes) prior to
applying data mining analysis.
• This initial relation may or may not correspond to physical relation in the database.
• Since virtual relations are called Views in the field of databases, the set of task-relevant
data for data mining is called a minable view
• If data mining task is to study associations between items frequently purchased at
AllElectronics by customers in Canada, the task relevant data can be specified by
providing the following information
Task-Relevant Data (Minable View)
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria
Task-Relevant Data (Minable View)
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria
• Data portion to be investigated.
• Attributes of interest (relevant attributes) can be specified.
• Initial data relation
• Minable view
Task-Relevant Data (Minable View)
If a data mining task is to study associations between items frequently purchased at All Electronics by
customers in Canada, the task relevant data can be specified by providing the following information:
 Name of the database or data warehouse to be used (e.g., AllElectronics_db)
 Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases and
items_sold)
 Conditions for selecting the relevant data (e.g., retrieve data pertaining to
purchases made in Canada for the current year)
 The relevant attributes or dimensions (e.g., name and price from the item table and income and
age from the customer table)
The kind of knowledge to be mined
It is important to specify the kind of knowledge to be mined, as this determines the data mining
functions to be performed.
The kinds of knowledge include concept description (characterization and discrimination), association,
classification, predication, clustering, and evolution analysis.
In addition to specifying the kind of knowledge to be mined for a given data mining task, the user can
be more specific and provide pattern templates that all discovered patterns must match
The kind of knowledge to be mined
These templates, or metapatterns (also called metarules or metaqueries), can be used to guide the
discovery process. The use of metapatterns is illustrated in the following example.
A user studying the buying habits of Allelectronics customers may choose to mine association rules of
the form:
P (X:customer,W) ^ Q (X,Y) => buys (X,Z)
Here X is a key of the customer relations, P & Q are predicate variables and W,Y and Z are object
variables
[1.4%, 70%]
The kind of knowledge to be mined
The search for association rules is confined to those matching the given metarule, such as
age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”)
[2.2%, 60%] and
occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”)
[1.4%, 70%]
The former rule states that customers in their thirties, with an annual income of between 40K and 49K,
are likely (with 60% confidence) to purchase a VCR, and such cases represent about 2.2.% of the
total number of transactions.
The latter rule states that customers who are students and in their twenties are likely (with 70%
confidence) to purchase a computer, and such cases represent about 1.4% of the total number of
transactions.
Types of knowledge to be mined
• Characterization
• Discrimination
• Association
• Classification/prediction
• Clustering
• Outlier analysis
• Other data mining tasks
Summary
• Data preparation is a big issue for both warehousing and mining
• Data preparation includes
• Data cleaning and data integration
• Data reduction and feature selection
• Discretization
• A lot a methods have been developed but still an active area of
research
January 7, 2024
80
Summary
• Data preparation is a big issue for both warehousing and mining
• Data preparation includes
• Data cleaning and data integration
• Data reduction and feature selection
• Discretization
• A lot a methods have been developed but still an active area of research
81
What is Data Warehousing?
A process of transforming data into
information and making it available to
users in a timely enough manner to
make a difference
[Forrester Research, April 1996]
Data
Information
82
Very Large Data Bases
• Terabytes -- 10^12 bytes:
• Petabytes -- 10^15 bytes:
• Exabytes -- 10^18 bytes:
• Zettabytes -- 10^21 bytes:
• Zottabytes -- 10^24 bytes:
Walmart -- 24 Terabytes
Geographic Information Systems
National Medical Records
Weather images
Intelligence Agency Videos
83
What is a Data Warehouse?
A single, complete and consistent store of data
obtained from a variety of different sources made
available to end users in a what they can
understand and use in a business context.
[Barry Devlin]
84
Data Warehousing -- It is a process
• Technique for assembling and managing data from
various sources for the purpose of answering business
questions. Thus making decisions that were not
previous possible
• A decision support database maintained separately
from the organization’s operational database
85
What is Data Warehouse?
• Defined in many different ways, but not rigorously.
• A decision support database that is maintained separately from the organization’s
operational database
• Support information processing by providing a solid platform of consolidated, historical data
for analysis.
• “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon
• Data warehousing:
• The process of constructing and using data warehouses
86
Data Warehouse—Subject-Oriented
• Organized around major subjects, such as customer, product, sales.
• Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing.
• Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
87
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data sources
• relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.
88
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly longer than that of
operational systems.
• Operational database: current value data.
• Data warehouse data: provide information from a historical perspective (e.g., past 5-
10 years)
• Every key structure in the data warehouse
• Contains an element of time, explicitly or implicitly
• But the key of operational data may or may not contain “time element”.
89
Data Warehouse—Non-Volatile
• A physically separate store of data transformed from the operational
environment.
• Operational update of data does not occur in the data warehouse
environment.
• Does not require transaction processing, recovery, and concurrency control
mechanisms
• Requires only two operations in data accessing:
• initial loading of data and access of data.
90
Data Warehouse vs. Heterogeneous DBMS
• Traditional heterogeneous DB integration:
• Build wrappers/mediators on top of heterogeneous databases
• Query driven approach
• When a query is posed to a client site, a meta-dictionary is used to translate the query into
queries appropriate for individual heterogeneous sites involved, and the results are integrated
into a global answer set
• Complex information filtering, compete for resources
• Data warehouse: update-driven, high performance
• Information from heterogeneous sources is integrated in advance and stored in warehouses for direct
query and analysis
91
Data Warehouse vs. Operational DBMS
• OLTP (on-line transaction processing)
• Major task of traditional relational DBMS
• Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration,
accounting, etc.
• OLAP (on-line analytical processing)
• Major task of data warehouse system
• Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
• User and system orientation: customer vs. market
• Data contents: current, detailed vs. historical, consolidated
• Database design: ER + application vs. star + subject
92
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated
usage repetitive ad-hoc
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
93
Why Separate Data Warehouse?
• High performance for both systems
• DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
• Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
• Different functions and different data:
• missing data: Decision support requires historical data which operational DBs do not
typically maintain
• data consolidation: DS requires consolidation (aggregation, summarization) of data
from heterogeneous sources
• data quality: different sources typically use inconsistent data representations, codes
and formats which have to be reconciled
94
Typical Process Flow Within a Data Warehouse
Archive data
Figure : Process flow within a data warehouse
Data transformation
and movement
Source Warehouse Users
Extract And load Query
95
1. Extract and load the data
2. Clean and transform data into a form that can cope with large data
volumes and provide good query performance.
3. Back up and archive data
4. Manage queries and direct them to the appropriate data sources
96
Extract and Load Process
1. Controlling the Process
- Determine when to start extracting the data
2. When to initiate the extract
- Data should be in a consistent state
- Start extracting data from data sources when it represents the same snapshot of time as
all the other data sources
3. Loading the data
- Do not execute consistency checks until all the data sources have been loaded
into the temporary data store
4. Copy Management Tools and Data cleanup
97
Clean and Transform Data
1. Clean and Transform the data
Data needs to be cleaned and checked in the following ways:
- Make sure data is consistent within itself
- Make sure that data is consistent with other data within the same source
- Make sure data is consistent with data in the other source systems.
- Make sure data is consistent with the information already in the warehouse
98
2. Transforming into Effective Structure
- Once the data has been cleaned, convert the source data in the
temporary data store into a structure that is designed to balance
query performance and operational cost
99
Backup and Archive Process
• The data within the data warehouse is backed up regularly
in order to ensure that the data warehouse can always be
recovered from data loss, software failure or hardware
failure.
100
Query Management Process
• System process that manages the queries an speeds them up by directing queries to
the most effective data source.
• Directing Queries to the suitable tables
• Maximizing System Resources
• Query Capture
- Query profiles change on a regular basis
- In order to accurately monitor and understand what the new query profiles are, it
can be very effective to capture the physical queries that are being executed.
101
Design of a Data Warehouse: A Business
Analysis Framework
• Four views regarding the design of a data warehouse
• Top-down view
• allows selection of the relevant information necessary for the data warehouse
• Data source view
• exposes the information being captured, stored, and managed by operational systems
• Data warehouse view
• consists of fact tables and dimension tables
• Business query view
• sees the perspectives of data in the warehouse from the view of end-user
102
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
• Top-down: Starts with overall design and planning (mature)
• Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
• Waterfall: structured and systematic analysis at each step before proceeding to the next
• Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn
around
• Typical data warehouse design process
• Choose a business process to model, e.g., orders, invoices, etc.
• Choose the grain (atomic level of data) of the business process
• Choose the dimensions that will apply to each fact table record
• Choose the measure that will populate each fact table record
103
Multi-Tiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
104
Three Data Warehouse Models
• Enterprise warehouse
• collects all of the information about subjects spanning the entire organization
• Data Mart
• a subset of corporate-wide data that is of value to a specific groups of users. Its
scope is confined to specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
• A set of views over operational databases
• Only some of the possible summary views may be materialized
105
Data Warehouse Development: A Recommended Approach
Define a high-level corporate data model
Data
Mart
Data
Mart
Distributed Data
Marts
Multi-Tier Data
Warehouse
Enterprise Data
Warehouse
Model refinement
Model refinement
106
DMQL—A Data Mining Query Language
• Motivation
• A DMQL can provide the ability to support ad-hoc and interactive data mining
• By providing a standardized language like SQL
• Hope to achieve a similar effect like that SQL has on relational database
• Foundation for system development and evolution
• Facilitate information exchange, technology transfer, commercialization and wide
acceptance
107
An Example Query in DMQL
108
Integration of Data Mining and Data Warehousing
• Data mining systems, DBMS, Data warehouse systems coupling
• No coupling, loose-coupling, semi-tight-coupling, tight-coupling
• On-line analytical mining data
• integration of mining and OLAP technologies
• Interactive mining multi-level knowledge
• Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling,
pivoting, slicing/dicing, etc.
• Integration of multiple mining functions
• Characterized classification, first clustering and then association
109
Coupling Data Mining with DB/DW Systems
• No coupling—flat file processing, not recommended
• Loose coupling
• Fetching data from DB/DW
• Semi-tight coupling—enhanced DM performance
• Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting,
indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions
• Tight coupling—A uniform information processing environment
• DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining
query, indexing, query processing methods, etc.
110
Architecture: Typical Data Mining System
data cleaning, integration, and selection
Database or Data Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowl
edge-
Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
10 Open Source ETL Tools
https://www.datasciencecentral.com/profiles/blogs/10-open-source-etl-tools
Thank You!
In our next session:

More Related Content

Similar to dwdm unit 1.ppt

Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
bhagathk
 

Similar to dwdm unit 1.ppt (20)

Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Introduction to dm and dw
Introduction to dm and dwIntroduction to dm and dw
Introduction to dm and dw
 
A review on data mining
A  review on data miningA  review on data mining
A review on data mining
 
Data mining
Data miningData mining
Data mining
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
 
Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data mining
Data miningData mining
Data mining
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 

Recently uploaded

Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 

dwdm unit 1.ppt

  • 1. Data Ware Housing and Data Mining Dr. Sunil Bhutada Professor and Head IT Department
  • 2. About Me Dr. Sunil Bhutada B.E. (CSE), M.Tech (S/W Engg), Ph.D. (CSE) Positions Held 1) 1994- 1998 Worked with Indian Rayon & Industries, Baroda 2) 1998-2005 Worked as a Asst Prof with Jatipita Engg College, Adilabad 3) 2005-2017 Worked as Associate Professor with Sreenidhi Institute of Science and Technology, Hyderabad 4) 2017 onwards Working as Professor with Sreenidhi Institute of Science and Technology, Hyderabad 5) 2021 onwards Working as Professor & Head with Sreenidhi Institute of Science and Technology, Hyderabad
  • 3. Syllabus – Unit 1 Introduction: Fundamentals of data mining, KDD process, Data Mining Functionalities, Classification of Data Mining systems, Data Mining Task primitives, Integration of a Data mining System with a Database or a Data warehouse systems, Major issues in Data Mining. Data Preprocessing: Needs for Preprocessing the Data, Data Cleaning, Data Integration and Transformation, Data Reduction, Discretization and Concept Hierarchy Generation, Data Mining Primitives, Data Mining Query Languages, Architectures of Data Mining Systems.
  • 4. TEXT BOOKS 1. Data mining: Concepts and Techniques, Jiawei Han and Micheline Kamber, 2nd Edition, Elsevier, 2006. 2. Data Mining Techniques – ARUN K PUJARI, University Press.
  • 6. Information Hierarchy (Basic Concepts) • Data : The raw material of information • Information : Data organized and presented in a particular manner • Knowledge : “Justified true belief”. Information that can be acted upon • Wisdom : Distilled and integrated knowledge Demonstrative of high-level “understanding”
  • 7. Information Hierarchy (A facetious Example) • Data 98.6º F, 99.5º F, 100.3º F, 101º F, … • Information Hourly body temperature: 98.6º F, 99.5º F, 100.3º F, 101º F, … • Knowledge If you have a temperature above 100º F, you most likely have a fever • Wisdom If you don’t feel well, go see a doctor
  • 8. Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s – Stream data management and mining – Data mining and its applications – Web technology (XML, data integration) and global information systems
  • 10.
  • 11. • Corporate decision makers require access to all the organization’s data, wherever it is located. • To provide comprehensive analysis of the organization, its business, its requirements and any trends, require access to not only the current data in the database but also to historical data. • This course will involve an in-depth study of various concepts needed to design and develop a data warehouse. • It also provides an introduction to data mining and end user access tools for a data warehouse. Why to study this subject?
  • 12. • Business Intelligence (BI) is a process of gathering, analyzing, and transforming raw data into accurate, efficient, and meaningful information which can be used to make wise business decisions and refine business strategy. • BI gives organizations a sense of clairvoyance. • Business Intelligence testing initiatives help companies gain deeper and better insights so they can manage or make decisions based on hard facts or data. Business Intelligence
  • 13. • DBMSs widely used to maintain transactional data • Attempts to use of these data for analysis, exploration, identification of trends etc. has led to Decision Support Systems. • Trend towards Data Warehousing • Data Warehousing – consolidation of data from several databases which are in turn maintained by individual business units along with historical and summary information From DBMS to Decision Support
  • 14. • A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. • A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. • The data warehouse is the core of the BI system which is built for data analysis and reporting. What is data warehouse?
  • 15. Note: A data warehouse does not require transaction processing, recovery, and concurrency controls, because it is physically stored and separate from the operational database Features of Data Warehouse
  • 16. Data warehouse system is also known by the following name: • Decision Support System (DSS) • Executive Information System • Management Information System • Business Intelligence Solution • Analytic Application • Data Warehouse Various versions of Data Warehouse
  • 17. Growth of Data Warehouse 1960- Dartmouth and General Mills 1970- A Nielse (DM) 1983- Tera Data Corporation (DSS)
  • 18. 18 What Is Data Mining? • Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Data mining: a misnomer? • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • Watch out: Is everything “data mining”? • Simple search and query processing • (Deductive) expert systems
  • 19. What is (not) Data Mining?  What is Data Mining? – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)  What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon”
  • 20. Why Mine Data? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/grocery stores • Bank/Credit Card transactions • Twice as much information was created in 2002 as in 1999 (~30% growth rate) • Other growth rate estimates even higher
  • 21. Largest databases in 2007 • Largest database in the world: World Data Centre for Climate (WDCC) operated by the Max Planck Institute and German Climate Computing Centre • 220 terabytes of data on climate research and climatic trends, • 110 terabytes worth of climate simulation data. • 6 petabytes worth of additional information stored on tapes. • AT&T • 323 terabytes of information • 1.9 trillion phone call records • Google • 91 million searches per day, • After a year worth of searches, this figure amounts to more than 33 trillion database entries.
  • 22. Why Mine Data? Scientific Viewpoint • Data is collected and stored at enormous speeds (GB/hour). E.g. – remote sensors on a satellite – telescopes scanning the skies – scientific simulations generating terabytes of data • Very little data will ever be looked at by a human • Knowledge Discovery is NEEDED to make sense and use of data.
  • 23. Data Mining • Data mining is the process of automatically discovering useful information in large data repositories. • Human analysts may take weeks to discover useful information. • Much of the data is never analyzed at all. 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 1995 1996 1997 1998 1999 The Data Gap Total new disk (TB) since 1995 Number of analysts
  • 24. Why Data Mining?—Potential Applications • Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis
  • 25. Knowledge Discovery (KDD) Process • Data mining—core of knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 26. Knowledge Discovery (KDD) Process – Several Key Steps 1. Data Cleaning 2. Data integration 3. Data selection 4. Data transformation 5. Data mining 6. Pattern evaluation 7. Knowledge presentation
  • 27. The overall process of finding and interpreting patterns from data involves the repeated application of the following steps: Developing an understanding of the application domain the relevant prior knowledge the goals of the end-user Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed. Data cleaning and preprocessing. Removal of noise or outliers. Collecting necessary information to model or account for noise. Strategies for handling missing data fields. Accounting for time sequence information and known changes.
  • 28. The overall process of finding and interpreting patterns from data involves the repeated application of the following steps: Data reduction and projection. Finding useful features to represent the data depending on the goal of the task. Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data. Choosing the data mining task. Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
  • 29. The overall process of finding and interpreting patterns from data involves the repeated application of the following steps: Choosing the data mining algorithm(s). Selecting method(s) to be used for searching for patterns in the data. Deciding which models and parameters may be appropriate. Matching a particular data mining method with the overall criteria of the KDD process. Data mining. Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth. Interpreting mined patterns. Consolidating discovered knowledge.
  • 30. Data Mining and Business Intelligence 30 Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 31. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  • 32. Origins of Data Mining Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniques may be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data
  • 33. Data Mining Functionalities (1) • Concept description: Characterization and discrimination • Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions • Data characterization is summarization of the general characteristics or features of a target class of data. • For ex, to study the characteristics of software products whose sales increased by 10% in the last year, the data related to such products can be collected by SQL query • Data discrimination is a comparison of the general features of target class data objects with the general features of objects from one or a set of contrasting classes. The target and contrasting classes can be specified by the user. • For ex : the user may like to compare the general features of software products whose sales increased by 10% in the last year with those whose sales decreased by at least 30% during the same period.
  • 34. Data Mining Functionalities (1) • Association Analysis (correlation and causality) • Association Analysis is the discovery of association rules showing attribute value conditions that occur frequently together in a given set of data. Association is widely used for market basket or transaction data analysis. • Multi-dimensional vs. single-dimensional association • age(X, “20..29”) ^ income(X, “20..29K”) à buys (X, “PC”) • [support = 2%, confidence = 60%] • The number of times, this item set appears in the database is called its "support" • Confidence of rule "B given A" is a measure of how much more likely it is that B occurs when A has occurred. It is expressed as a percentage, with 100% meaning B always occurs if A has occurred • zcontains(T, “computer”) à contains(x, “software”) [1%, 75%]
  • 35. Data Mining Functionalities (2) • Classification and Prediction • Finding models (functions) that describe and distinguish classes or concepts for future prediction • E.g., classify countries based on climate, or classify cars based on gas mileage • Presentation: decision-tree, classification rule, neural network • Prediction: Predict some unknown or missing numerical values • Cluster analysis • Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns • Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
  • 36. Data Mining Functionalities (3) • Outlier analysis • Outlier: a data object that does not comply with the general behavior of the data • It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis • Trend and deviation: regression analysis • Sequential pattern mining, periodicity analysis • Similarity-based analysis • Other pattern-directed or statistical analyses
  • 37. Are All the “Discovered” Patterns Interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. • Suggested approach: Human-centered, query-based, focused mining • Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm • • Objective vs. subjective interestingness measures: • Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. • Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, action ability, etc.
  • 38. Can We Find All and Only Interesting Patterns? • Find all the interesting patterns: Completeness • Can a data mining system find all the interesting patterns? • Association vs. classification vs. clustering • Search for only interesting patterns: Optimization • Can a data mining system find only the interesting patterns? • Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization
  • 39. Integration of a Data Mining System with a DB or DWH System • Critical question in the design of a data mining (DM) system is how to integrate or couple the DM system with a database (DB) system and/or a data warehouse (DW) system. Four ways we can integrate i. No Coupling, ii. Loose Coupling iii. Semi Tight Coupling iv. Tight Coupling
  • 40. No Coupling: • Means that a DM system will not utilize any function of a DB or DW system. • It may fetch data from a particular source (such as a file system), process data using some data mining algorithms, and then store the mining results in another file. • Such a system, though simple, suffers from several drawbacks. • First, a DB system provides a great deal of flexibility and efficiency at storing, organizing, accessing, and processing data. Without using a DB/DW system, a DM system may spend a substantial amount of time finding, collecting, cleaning, and transforming data. • In DB and/or DW systems, data tend to be well organized, indexed, cleaned, integrated, or consolidated, so that finding the task-relevant, high-quality data becomes an easy task.
  • 41. No Coupling: • Second, • There are many tested, scalable algorithms and data structures implemented in DB and DW systems. It is feasible to realize efficient, scalable implementations using such systems. • Moreover, most data have been or will be stored in DB/DW systems. • Without any coupling of such systems, a DM system will need to use other tools to extract data, making it difficult to integrate such a system into an information processing environment. Thus, no coupling represents a poor design.
  • 42. Loss Coupling: • Loose coupling means that a DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse. • Loose coupling is better than no coupling because it can fetch any portion of data stored in databases or data warehouses by using query processing, indexing, and other system facilities. • It incurs some advantages of the flexibility, efficiency, and other features provided by such systems. • However, many loosely coupled mining systems are main memory-based. Because mining does not explore data structures and query optimization methods provided by DB or DW systems, it is difficult for loose coupling to achieve high scalability and good performance with large data sets.
  • 43. Semi Tight Coupling: • Semitight coupling means that besides linking a DM system to a DB/Dw system, efficient implementations of a few essential data mining primitives (frequently encountered data mining functions) can be provided in the DB/DW system. • These primitives can include sorting, indexing, aggregation, histogram analysis, multiway join, and precomputation of some essential statistical measures, such as sum, count, max, min, standard deviation, and so on.
  • 44. Tight Coupling: • Tight coupling: Tight coupling means that a DM system is smoothly integrated into the DB/DW system. • The data mining subsystem is treated as one functional component of an information system. Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schemes, and query processing methods of a DB or DW system.
  • 45. Major Issues in Data Mining • Mining methodology • Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web • Performance: efficiency, effectiveness, and scalability • Pattern evaluation: the interestingness problem • Incorporation of background knowledge • Handling noise and incomplete data • Parallel, distributed and incremental mining methods • Integration of the discovered knowledge with existing one: knowledge fusion • User interaction • Data mining query languages and ad-hoc mining • Expression and visualization of data mining results • Interactive mining of knowledge at multiple levels of abstraction • Applications and social impacts • Domain-specific data mining & invisible data mining • Protection of data security, integrity, and privacy
  • 46. Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
  • 47. Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data
  • 48. Forms of data preprocessing
  • 49. Data Cleaning • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data
  • 50. Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred.
  • 51. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
  • 52. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data
  • 53. How to Handle Noisy Data? • Binning method: • first sort data and partition into (equi-depth) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Clustering • detect and remove outliers • Combined computer and human inspection • detect suspicious values and check by human • Regression • smooth by fitting the data into regression functions
  • 54. Simple Discretization Methods: Binning • Equal-width (distance) partitioning: • It divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. • The most straightforward • But outliers may dominate presentation • Skewed data is not handled well. • Equal-depth (frequency) partitioning: • It divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky. Smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
  • 55. Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 56. Data Integration • Data integration: • combines data from multiple sources into a coherent store • Schema integration • integrate metadata from different sources • Entity identification problem: identify real world entities from multiple data sources. • Detecting and resolving data value conflicts • for the same real world entity, attribute values from different sources are different • possible reasons: different representations, different scales, e.g., metric vs. British units
  • 57. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases • The same attribute may have different names in different databases • One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant data may be able to be detected by correlational analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 58. Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling Attribute/feature construction • New attributes constructed from the given ones
  • 59. Data Transformation: Normalization • min-max normalization • z-score normalization • normalization by decimal scaling A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A dev stand mean v v _ '   j v v 10 ' Where j is the smallest integer such that Max(| |)<1
  • 60. Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
  • 61. Discretization and Concept Hierarchy Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
  • 62. Discretization https://www.intellspot.com/data-types/ Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
  • 63. Discretization and concept hierarchy generation for numeric data • Binning • Histogram analysis • Clustering analysis • Entropy-based discretization • Segmentation by natural partitioning
  • 64. Concept hierarchy generation for categorical data • Specification of a partial ordering of attributes explicitly at the schema level by users or experts • Specification of a portion of a hierarchy by explicit data grouping • Specification of a set of attributes, but not of their partial ordering • Specification of only a partial set of attributes
  • 65. Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. January 7, 2024 country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values 674,339 distinct values
  • 66. Data Mining Primitives  Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention.  If there was no user intervention then the system would uncover a large set of patterns that may even surpass the size of the database. Hence, user interference is required.  This user communication with the system is provided by using a set of data mining primitives.
  • 67. Why Data Mining Primitives and Languages? Task-relevant data : What is the data set I want to mine? Type of knowledge to be mined : What kind of knowledge do I want to mine ? Background knowledge : What background knowledge could be useful here ? Pattern interestingness measurements : What measures can be useful to estimate pattern interestingness ? Visualization of discovered patterns : How do I want the discovered patterns to be presented ?
  • 68. Data Mining Primitives : What Defines a Data Mining Task ? • A popular misconception about data mining is to except that data mining systems can autonomously dig out all of the valuable knowledge and patterns that is embedded in large database, without human intervention or guidance. • Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting • Data mining should be an interactive process • User directs what to be mined • Users must be provided with a set of primitives to be used to communicate with the data mining system • Incorporating these primitives in a data mining query language • More flexible user interaction • Foundation for design of graphical user interface • Standardization of data mining industry and practice
  • 69. Primitives for specifying a data mining task
  • 70. Task-Relevant Data (Minable View) • The first primitive is the specification of the data on which mining is to be performed. • Typically, a user is interested in only a subset of the database. It is impractical to mine the entire database, particularly since the number of patterns generated could be exponential w.r.t the database size. • Furthermore, many of the patterns found would be irrelevant to the interests of the user. • In a relational database, the set of task relevant data can be collected via a relational query involving operations like selection, projection, join and aggregation. • This retrieval of data can be thought of as a “subtask” of the data mining task. The data collection process results in a new data relational called the initial data relation
  • 71. Task-Relevant Data (Minable View) • The initial data relation can be ordered or grouped according to the conditions specified in the query. • The data may be cleaned or transformed (e.g. aggregated on certain attributes) prior to applying data mining analysis. • This initial relation may or may not correspond to physical relation in the database. • Since virtual relations are called Views in the field of databases, the set of task-relevant data for data mining is called a minable view • If data mining task is to study associations between items frequently purchased at AllElectronics by customers in Canada, the task relevant data can be specified by providing the following information
  • 72. Task-Relevant Data (Minable View) • Database or data warehouse name • Database tables or data warehouse cubes • Condition for data selection • Relevant attributes or dimensions • Data grouping criteria
  • 73. Task-Relevant Data (Minable View) • Database or data warehouse name • Database tables or data warehouse cubes • Condition for data selection • Relevant attributes or dimensions • Data grouping criteria • Data portion to be investigated. • Attributes of interest (relevant attributes) can be specified. • Initial data relation • Minable view
  • 74. Task-Relevant Data (Minable View) If a data mining task is to study associations between items frequently purchased at All Electronics by customers in Canada, the task relevant data can be specified by providing the following information:  Name of the database or data warehouse to be used (e.g., AllElectronics_db)  Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases and items_sold)  Conditions for selecting the relevant data (e.g., retrieve data pertaining to purchases made in Canada for the current year)  The relevant attributes or dimensions (e.g., name and price from the item table and income and age from the customer table)
  • 75. The kind of knowledge to be mined It is important to specify the kind of knowledge to be mined, as this determines the data mining functions to be performed. The kinds of knowledge include concept description (characterization and discrimination), association, classification, predication, clustering, and evolution analysis. In addition to specifying the kind of knowledge to be mined for a given data mining task, the user can be more specific and provide pattern templates that all discovered patterns must match
  • 76. The kind of knowledge to be mined These templates, or metapatterns (also called metarules or metaqueries), can be used to guide the discovery process. The use of metapatterns is illustrated in the following example. A user studying the buying habits of Allelectronics customers may choose to mine association rules of the form: P (X:customer,W) ^ Q (X,Y) => buys (X,Z) Here X is a key of the customer relations, P & Q are predicate variables and W,Y and Z are object variables [1.4%, 70%]
  • 77. The kind of knowledge to be mined The search for association rules is confined to those matching the given metarule, such as age (X, “30…..39”) ^ income (X, “40k….49K”) => buys (X, “VCR”) [2.2%, 60%] and occupation (X, “student ”) ^ age (X, “20…..29”)=> buys (X, “computer”) [1.4%, 70%] The former rule states that customers in their thirties, with an annual income of between 40K and 49K, are likely (with 60% confidence) to purchase a VCR, and such cases represent about 2.2.% of the total number of transactions. The latter rule states that customers who are students and in their twenties are likely (with 70% confidence) to purchase a computer, and such cases represent about 1.4% of the total number of transactions.
  • 78. Types of knowledge to be mined • Characterization • Discrimination • Association • Classification/prediction • Clustering • Outlier analysis • Other data mining tasks
  • 79. Summary • Data preparation is a big issue for both warehousing and mining • Data preparation includes • Data cleaning and data integration • Data reduction and feature selection • Discretization • A lot a methods have been developed but still an active area of research
  • 80. January 7, 2024 80 Summary • Data preparation is a big issue for both warehousing and mining • Data preparation includes • Data cleaning and data integration • Data reduction and feature selection • Discretization • A lot a methods have been developed but still an active area of research
  • 81. 81 What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] Data Information
  • 82. 82 Very Large Data Bases • Terabytes -- 10^12 bytes: • Petabytes -- 10^15 bytes: • Exabytes -- 10^18 bytes: • Zettabytes -- 10^21 bytes: • Zottabytes -- 10^24 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records Weather images Intelligence Agency Videos
  • 83. 83 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]
  • 84. 84 Data Warehousing -- It is a process • Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible • A decision support database maintained separately from the organization’s operational database
  • 85. 85 What is Data Warehouse? • Defined in many different ways, but not rigorously. • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. • “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon • Data warehousing: • The process of constructing and using data warehouses
  • 86. 86 Data Warehouse—Subject-Oriented • Organized around major subjects, such as customer, product, sales. • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
  • 87. 87 Data Warehouse—Integrated • Constructed by integrating multiple, heterogeneous data sources • relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. • Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. • When data is moved to the warehouse, it is converted.
  • 88. 88 Data Warehouse—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database: current value data. • Data warehouse data: provide information from a historical perspective (e.g., past 5- 10 years) • Every key structure in the data warehouse • Contains an element of time, explicitly or implicitly • But the key of operational data may or may not contain “time element”.
  • 89. 89 Data Warehouse—Non-Volatile • A physically separate store of data transformed from the operational environment. • Operational update of data does not occur in the data warehouse environment. • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • initial loading of data and access of data.
  • 90. 90 Data Warehouse vs. Heterogeneous DBMS • Traditional heterogeneous DB integration: • Build wrappers/mediators on top of heterogeneous databases • Query driven approach • When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set • Complex information filtering, compete for resources • Data warehouse: update-driven, high performance • Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
  • 91. 91 Data Warehouse vs. Operational DBMS • OLTP (on-line transaction processing) • Major task of traditional relational DBMS • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. • OLAP (on-line analytical processing) • Major task of data warehouse system • Data analysis and decision making • Distinct features (OLTP vs. OLAP): • User and system orientation: customer vs. market • Data contents: current, detailed vs. historical, consolidated • Database design: ER + application vs. star + subject
  • 92. 92 OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
  • 93. 93 Why Separate Data Warehouse? • High performance for both systems • DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery • Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. • Different functions and different data: • missing data: Decision support requires historical data which operational DBs do not typically maintain • data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources • data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
  • 94. 94 Typical Process Flow Within a Data Warehouse Archive data Figure : Process flow within a data warehouse Data transformation and movement Source Warehouse Users Extract And load Query
  • 95. 95 1. Extract and load the data 2. Clean and transform data into a form that can cope with large data volumes and provide good query performance. 3. Back up and archive data 4. Manage queries and direct them to the appropriate data sources
  • 96. 96 Extract and Load Process 1. Controlling the Process - Determine when to start extracting the data 2. When to initiate the extract - Data should be in a consistent state - Start extracting data from data sources when it represents the same snapshot of time as all the other data sources 3. Loading the data - Do not execute consistency checks until all the data sources have been loaded into the temporary data store 4. Copy Management Tools and Data cleanup
  • 97. 97 Clean and Transform Data 1. Clean and Transform the data Data needs to be cleaned and checked in the following ways: - Make sure data is consistent within itself - Make sure that data is consistent with other data within the same source - Make sure data is consistent with data in the other source systems. - Make sure data is consistent with the information already in the warehouse
  • 98. 98 2. Transforming into Effective Structure - Once the data has been cleaned, convert the source data in the temporary data store into a structure that is designed to balance query performance and operational cost
  • 99. 99 Backup and Archive Process • The data within the data warehouse is backed up regularly in order to ensure that the data warehouse can always be recovered from data loss, software failure or hardware failure.
  • 100. 100 Query Management Process • System process that manages the queries an speeds them up by directing queries to the most effective data source. • Directing Queries to the suitable tables • Maximizing System Resources • Query Capture - Query profiles change on a regular basis - In order to accurately monitor and understand what the new query profiles are, it can be very effective to capture the physical queries that are being executed.
  • 101. 101 Design of a Data Warehouse: A Business Analysis Framework • Four views regarding the design of a data warehouse • Top-down view • allows selection of the relevant information necessary for the data warehouse • Data source view • exposes the information being captured, stored, and managed by operational systems • Data warehouse view • consists of fact tables and dimension tables • Business query view • sees the perspectives of data in the warehouse from the view of end-user
  • 102. 102 Data Warehouse Design Process • Top-down, bottom-up approaches or a combination of both • Top-down: Starts with overall design and planning (mature) • Bottom-up: Starts with experiments and prototypes (rapid) • From software engineering point of view • Waterfall: structured and systematic analysis at each step before proceeding to the next • Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around • Typical data warehouse design process • Choose a business process to model, e.g., orders, invoices, etc. • Choose the grain (atomic level of data) of the business process • Choose the dimensions that will apply to each fact table record • Choose the measure that will populate each fact table record
  • 103. 103 Multi-Tiered Architecture Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs other sources Data Storage OLAP Server
  • 104. 104 Three Data Warehouse Models • Enterprise warehouse • collects all of the information about subjects spanning the entire organization • Data Mart • a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart • Virtual warehouse • A set of views over operational databases • Only some of the possible summary views may be materialized
  • 105. 105 Data Warehouse Development: A Recommended Approach Define a high-level corporate data model Data Mart Data Mart Distributed Data Marts Multi-Tier Data Warehouse Enterprise Data Warehouse Model refinement Model refinement
  • 106. 106 DMQL—A Data Mining Query Language • Motivation • A DMQL can provide the ability to support ad-hoc and interactive data mining • By providing a standardized language like SQL • Hope to achieve a similar effect like that SQL has on relational database • Foundation for system development and evolution • Facilitate information exchange, technology transfer, commercialization and wide acceptance
  • 108. 108 Integration of Data Mining and Data Warehousing • Data mining systems, DBMS, Data warehouse systems coupling • No coupling, loose-coupling, semi-tight-coupling, tight-coupling • On-line analytical mining data • integration of mining and OLAP technologies • Interactive mining multi-level knowledge • Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. • Integration of multiple mining functions • Characterized classification, first clustering and then association
  • 109. 109 Coupling Data Mining with DB/DW Systems • No coupling—flat file processing, not recommended • Loose coupling • Fetching data from DB/DW • Semi-tight coupling—enhanced DM performance • Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions • Tight coupling—A uniform information processing environment • DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc.
  • 110. 110 Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowl edge- Base Database Data Warehouse World-Wide Web Other Info Repositories
  • 111. 10 Open Source ETL Tools https://www.datasciencecentral.com/profiles/blogs/10-open-source-etl-tools
  • 112. Thank You! In our next session: