From data mining to knowledge discovery in

From Data Mining to Knowledge Discovery in
Databases
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth
AI Magazine Volume 17 Number 3 (1996) (© AAAI)
Presented by: Raj Kumar Ranabhat
M.E in Computer Engineering(I/II)
Kathmandu University
2/4/2018 1

Table of Content:
1. Introduction
2. Why Do We Need KD?
3. Data Mining and Knowledge Discovery in the Real World
4. Basic Definitions
5. The KD Process
6. The Data-Mining Step of the KD Process
1. Data Mining Methods
2. The Components of Data Mining Algorithms
2/4/2018 2

Contd...
7. Some Data-Mining Methods
1. Decision Trees and Rules
2. Nonlinear Regression and Classification Methods
3. Example-Based Methods
4. Probabilistic Graphic Dependency Models
8. Research and Application Challenges
9. Conclusion
2/4/2018 3

1. Introduction
• Across a wide variety of fields, data are being collected and accumulated at a
dramatic pace
• There is urgent need on extracting useful information (knowledge) from the
rapidly growing volumes of digital data
• The Knowledge discovery (KD) field is concerned with the development of
methods and techniques for making sense of data
• KD process is mapping of low-level data into other forms that might be more
compact ,more abstract or more useful
2/4/2018 4

5
2. Why Do We Need KD ?
• The traditional method of turning data into knowledge relies on manual analysis
and interpretation
• Eg. in the health-care industry
• Specialists periodically analyze current trends and changes in health-care data
• The specialists then provide a report detailing the analysis to the health-care
organization
• This report becomes the basis for future decision making and planning for
health-care management
• For these (and many other) applications, this form of manual probing of a data set
is slow, expensive, and highly subjective2/4/2018

2/4/2018
Contd...
• As data volumes grow dramatically, this type of manual data analysis is becoming
completely impractical in many domains
• Computational techniques to unearth meaningful patterns and structures from the
massive volumes of data
• KD is an attempt to address a problem that the digital information era made a fact of
life for all of us: data overload
• Businesses use KD to gain competitive advantage, increase efficiency, and provide
more valuable services to customers
6

7
3. Data Mining and KD in the Real World
• KD applications and have been deployed on large-scale real-world problems in
science and in business
• Eg. SKICAT, a system used by astronomers to perform image analysis, cataloging
and classification of sky objects from sky-survey images
• Used to process the 3 terabytes (1012 bytes) of image data
• It is estimated that on the order of 109 sky objects are detectable
• SKICAT can outperform humans and traditional computational techniques in
classifying faint sky objects
2/4/2018

2/4/2018
Contd...
• KD application areas :
1. Marketing :
• Analyze customer databases to identify different customer groups and forecast
their behavior
• Eg. If customer bought X, he/she is also likely to buy Y and Z
2. Investment :
• Numerous companies use data mining for investment
• Eg. LBS Capital Management
• Its system uses expert systems, neural nets, and genetic algorithms to manage
portfolios totaling $600 million 8

2/4/2018
Contd...
3. Fraud detection :
• HNC Falcon and Nestor PRISM systems are used for monitoring credit card
fraud, watching over millions of accounts
• The FAIS system, is used to identify financial transactions that might indicate
money laundering activity
4. Manufacturing :
• The CASSIOPEE troubleshooting system, used to diagnose and predict
problems for the Boeing 737
• Faults, clustering methods are used
• CASSIOPEE received the European first prize for innovative application
9

2/4/2018
Contd...
5. Telecommunications :
• The telecommunications alarm-sequence analyzer (TASA) used a frequently
occurring alarm episodes from the alarm stream and presenting them as rules
6. Data cleaning :
• The MERGE-PURGE system was applied to the identification of duplicate
welfare claims
• IBM’s ADVANCED SCOUT, that helps National Basketball Association (NBA)
coaches organize and interpret data from NBA games
10

2/4/2018 11
4. Basic Definitions
• KD is the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data
• Data are a set of facts
• Pattern is an expression in some language describing a subset of the data or a
model applicable to the subset
• Process implies steps, like data preparation, search for patterns, knowledge
evaluation, and refinement etc.
• Data mining is a step in the KD process that consists of applying data analysis
and discovery algorithms that, to produce a patterns (or models) over the data

2/4/2018 12
5. The KD Process
• The KDD process is interactive and iterative, involving numerous steps
1. Identifying the goal
• Understanding of the application domain
• Relevant prior knowledge
2. Creating a target data set
• Selecting a data set or data samples, on which discovery is to be performed
3. Data cleaning and preprocessing
• Removing noise if appropriate
• Deciding on strategies for handling missing data fields

2/4/2018
Contd...
4. Data reduction and projection
• Finding useful features to represent the data depending on the goal of the task
• With dimensionality reduction methods, the effective number of variables
under consideration can be reduced Exploratory
5. Analysis and model and hypothesis selection
• Choosing the datamining algorithm(s) and selecting method(s) to be used for
searching for data patterns
6. Data Mining
• Searching for patterns of interest in a particular representational form
Implementation on KD
13

2/4/2018
Contd...
7. Interpreting mined patterns
• visualization of the extracted patterns
8. Implementation
• Using the knowledge directly
• Incorporating the knowledge into another system for further action
• Simply documenting it
• Reporting it to interested parties
14

Contd...
Figure 1: An Overview of the Steps That Compose the KD Process
2/4/2018 15

6. The Data-Mining Step of the KD Process
• KD Goals :
1. Verification : The system is limited to verifying the user’s hypothesis
2. Discovery : The system autonomously finds new patterns
• Prediction : The system finds patterns for predicting the future
behavior of some entities
• Description : The system finds patterns for presentation to a user in a human-
understandable form
• Data mining involves fitting models to, or determining patterns from, observed data
2/4/2018 16

6.1 Data-Mining Methods
• Primary Goals of Data Mining
1. Prediction : Uses some variables or fields in the database to predict unknown
or future values of other variables of interest
2. Description : Finds human-interpretable patterns describing the data
• Data-mining methods:
• Classification • Regression
• Clustering • Summarization
• Dependency Modeling • Change and deviation detection
2/4/2018 17

2/4/2018
Contd...
1. Classification :
• It is learning a function that maps (classifies) a data item into one of several
predefined classes
• Fraud detection and credit risk applications are particularly well suited to this
type of analysis
• Types of classification models
1. Classification by decision tree induction
2. Bayesian Classification
3. Neural Networks
4. Support Vector Machines (SVM)
18

Contd...
Figure 2: A Simple Linear Classification Boundary for the Loan Data Set.The
shaped region denotes class no loan
2/4/2018 19
• x’s represent persons who have
defaulted on their loans
• o’s represent persons whose
loans are in good status with the
bank

2/4/2018
Contd...
20
2. Regression :
• It is learning a function that maps a data item to a real-valued prediction variable
• It establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line
• It is represented by an equation Y=a+b*X + e
• a is intercept, b is slope of the line and e is error term
• This equation can be used to predict the value of target variable based on
given predictor variable(s)

Contd...
Figure 3: A Simple Linear Regression for the Weight and Height Data Set
https://www.analyticsvidhya.com/wp content/uploads/2015/08/Linear_Regression1.png
2/4/2018 21

2/4/2018
Contd...
• Eg.
1. Estimating the probability that a patient will survive given the results of a set
of diagnostic tests
2. Predicting the amount of biomass present in a forest given remotely sensed
microwave measurements
• Types of regression methods
1. Linear Regression
2. Multivariate Linear Regression
3. Nonlinear Regression
4. Multivariate Nonlinear Regression
22

2/4/2018
Contd...
23
3. Clustering :
• Clustering can be said as identification of similar classes of objects
• Clustering can identify dense and sparse regions in object space and can
discover overall distribution pattern and correlations among data attributes
• Types of Clustering models
1. Partitioning Methods
2. Hierarchical Agglomerative (divisive) methods
3. Density based methods
4. Grid-based methods
5. Model-based methods

Contd...
Figure 4: A Simple Clustering of the Age and Purchase Power Data Set into Three Cluster
2/4/2018 24

2/4/2018
Contd...
25
3. Summarization :
• It involves methods for finding a compact description for a subset of data
• Eg.
• Tabulating the mean and standard deviations for all fields
• Discovery of functional relationships between variables
• Summarization techniques are often applied to interactive exploratory data
analysis and automated report generation
4. Change and deviation detection:
• Focuses on discovering the most significant changes in the data from previously
measured or normative values

2/4/2018
Contd...
26
5. Dependency modeling :
• Consists of finding a model that describes significant dependencies between
variables
• Dependency models exist at two levels :
• Structural level: specifies (often in graphic form) which variables are locally
dependent on each other
• Quantitative level: specifies the strengths of the dependencies using some
numeric scale
• Eg. Based on historical sale data, retailers might find out that customers always
buy cookies when they buy beers

6.2 The Components of Data-Mining Algorithms
• Three primary components in any data-mining algorithm:
1. Model representation : It is the language used to describe discoverable patterns
2. Model-evaluation criteria : Estimates how well a particular pattern (a model
and its parameters) meet the criteria of the KD process
3. Search method : consists of two components
1. Parameter search :
• It searches for the parameters which optimize the model evaluation criteria
given observed data and a fixed model representation
2. Model search :
• It occurs as a loop over the parameter search method
• The model representation is changed so that a family of models are considered
2/4/2018 27

7. Some Data-Mining Algorithms
1. Decision Trees and Rules :
• An internal node is a test on an attribute
• A branch represents an outcome of the test, e.g., Color=red
• A leaf node represents a class label or class label distribution
• At each node, one attribute is chosen to split training examples into distinct
classes as much as possible
• A new instance is classified by following a matching path to a leaf node
2/4/2018 28

29 Figure 5: Weather Data
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Contd...
2/4/2018

30
overcast
high normal falsetrue
rain
No NoYes Yes
Yes
Outlook
Humidity Windy
sunny
Contd...
Figure 6: Weather Data Tree2/4/2018

2. Nonlinear Regression and Classification Methods :
• It is a techniques for prediction that fit linear and nonlinear combinations of
basis functions to combinations of the input variables
• Eg. feedforward neural networks, adaptive spline methods, and projection
pursuit regression
2/4/2018 31
Contd...

32
Figure 7:An Example of Classification Boundaries Learned by a Nonlinear
Classifier (Such as a Neural Network) for the Loan Data Set
Contd...
2/4/2018

3. Example-Based Methods :
• Predictions on new examples are derived from the properties of similar examples
in the model whose prediction is known
• Eg. Nearest-neighbor classification and regression algorithms and case-based
reasoning systems
• Disadvantages:
• Well-defined distance metric for evaluating the distance between data points is
required
• Eg. If we used loan, sex, and profession, as variable then it would require more
effort to define a sensible metric
2/4/2018 33
Contd...

34
Figure 8: Classification Boundaries for a Nearest-Neighbor
Classifier for the Loan Data Set
Contd...
2/4/2018

4. Probabilistic Graphic Dependency Models :
• It specify probabilistic dependencies between variables using a graph structure
• These models were initially developed within the framework of probabilistic
expert systems
• Model-evaluation criteria are typically Bayesian in form
• parameter estimation can be a mixture of closed-form estimates and iterative
methods depending on whether a variable is directly observed or hidden
• Although still primarily in the research phase, the graphic form of the model
lends itself easily to human interpretation hence has huge impact on KD
2/4/2018 35
Contd...

8. Research and Application Challenges
1. Larger Databases :
• Databases with hundreds of fields and tables and millions of records and of a
multi gigabyte size are beginning to appear
• Possible solutions :
• More efficient algorithms sampling, approximation, and massively parallel
processing
2. High Dimensionality :
• There can also be a large number of fields (attributes, variables) hence the
dimensionality of the problem is high
2/4/2018 36

• A high-dimensional data set creates problems in terms of increasing the size of
the search space for model
• It increases the chances that a data-mining algorithm will find spurious
patterns
3. Overfitting
• It is a modeling error which occurs when a function is too closely fit to a limited
set of data points
• It result in a poor performance of the model on test data
• Cross-validation, regularization, and other sophisticated statistical strategies2/4/2018 37
Contd...

38
Figure 9:Overfitting
https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/300px-
Overfitting.svg.png
Contd...
• The green line represents an
overfitted model and the black
line represents a regularised
model
• While the green line best follows
the training data, it is likely to
have a higher error rate on new
unseen data
2/4/2018

4. Changing data and knowledge :
• Rapidly changing (nonstationary) data can make previously discovered patterns
invalid
• The variables measured in a given application database can be modified, deleted,
or augmented with new measurements over time
• Possible solutions
• Incremental methods for updating the patterns and
• Treating change as an opportunity for discovery by using it to cue the search
for patterns of change only
2/4/2018 39
Contd...

5. Missing and noisy data :
• This problem is especially acute in business databases
• U.S. census data reportedly have error rates as great as 20 percent in some fields
• Important attributes can be missing if the database was not designed with
discovery in mind
• More sophisticated statistical strategies to identify hidden variables and
dependencies
2/4/2018 40
Contd...

2/4/2018
6. Understandability of patterns :
• It is important to make the discoveries more understandable by humans
• Possible solutions
• Graphic representations ,rule structuring, natural language generation, and
techniques for visualization of data and knowledge
• Rule-refinement strategies can be used to address a related problem
7. Complex relationships between fields :
• Data-mining algorithms have been developed for simple attribute-value records
• New techniques for deriving relations between variables are being developed
41
Contd...

2/4/2018
• Hierarchically structured attributes or values, relations between attributes for
representing knowledge will require algorithms that can effectively use such
information
8. User interaction and prior knowledge
• Current KD methods and tools are not truly interactive
• It cannot easily incorporate prior knowledge about a problem except in simple
ways
• The use of domain knowledge is important in all the steps of the KD process
• Bayesian approaches use prior probabilities over data and distributions as one
form of encoding prior knowledge 42
Contd...

2/4/2018
9. Integration with other systems :
• A standalone discovery system might not be very useful
• Integration with a database management system, spreadsheets and visualization
tools, and accommodating of real-time sensor readings
43
Contd...

9. Conclusion
1. Some definitions of basic notions in the KD field was presented
2. The relation between knowledge discovery and data mining was clarified
3. A brief overview of the KD process and basic data-mining methods was provided
4. Although various algorithms and applications might appear quite different on the
surface, they share many common components
5. Understanding data mining and model induction at this component level makes it
easier for the user to understand its overall applicability to the KD process
6. A common framework for the common overall goals and methods used in KDD
was provided
2/4/2018 44

From data mining to knowledge discovery in

Recommended

Recommended

More Related Content

Similar to From data mining to knowledge discovery in

Similar to From data mining to knowledge discovery in (20)

More from Raj Kumar Ranabhat

More from Raj Kumar Ranabhat (8)

Recently uploaded

Recently uploaded (20)

From data mining to knowledge discovery in