SlideShare a Scribd company logo
From Data Mining to Knowledge Discovery in
Databases
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth
AI Magazine Volume 17 Number 3 (1996) (© AAAI)
Presented by: Raj Kumar Ranabhat
M.E in Computer Engineering(I/II)
Kathmandu University
2/4/2018 1
Table of Content:
1. Introduction
2. Why Do We Need KD?
3. Data Mining and Knowledge Discovery in the Real World
4. Basic Definitions
5. The KD Process
6. The Data-Mining Step of the KD Process
1. Data Mining Methods
2. The Components of Data Mining Algorithms
2/4/2018 2
Contd...
7. Some Data-Mining Methods
1. Decision Trees and Rules
2. Nonlinear Regression and Classification Methods
3. Example-Based Methods
4. Probabilistic Graphic Dependency Models
8. Research and Application Challenges
9. Conclusion
2/4/2018 3
1. Introduction
• Across a wide variety of fields, data are being collected and accumulated at a
dramatic pace
• There is urgent need on extracting useful information (knowledge) from the
rapidly growing volumes of digital data
• The Knowledge discovery (KD) field is concerned with the development of
methods and techniques for making sense of data
• KD process is mapping of low-level data into other forms that might be more
compact ,more abstract or more useful
2/4/2018 4
5
2. Why Do We Need KD ?
• The traditional method of turning data into knowledge relies on manual analysis
and interpretation
• Eg. in the health-care industry
• Specialists periodically analyze current trends and changes in health-care data
• The specialists then provide a report detailing the analysis to the health-care
organization
• This report becomes the basis for future decision making and planning for
health-care management
• For these (and many other) applications, this form of manual probing of a data set
is slow, expensive, and highly subjective2/4/2018
2/4/2018
Contd...
• As data volumes grow dramatically, this type of manual data analysis is becoming
completely impractical in many domains
• Computational techniques to unearth meaningful patterns and structures from the
massive volumes of data
• KD is an attempt to address a problem that the digital information era made a fact of
life for all of us: data overload
• Businesses use KD to gain competitive advantage, increase efficiency, and provide
more valuable services to customers
6
7
3. Data Mining and KD in the Real World
• KD applications and have been deployed on large-scale real-world problems in
science and in business
• Eg. SKICAT, a system used by astronomers to perform image analysis, cataloging
and classification of sky objects from sky-survey images
• Used to process the 3 terabytes (1012 bytes) of image data
• It is estimated that on the order of 109 sky objects are detectable
• SKICAT can outperform humans and traditional computational techniques in
classifying faint sky objects
2/4/2018
2/4/2018
Contd...
• KD application areas :
1. Marketing :
• Analyze customer databases to identify different customer groups and forecast
their behavior
• Eg. If customer bought X, he/she is also likely to buy Y and Z
2. Investment :
• Numerous companies use data mining for investment
• Eg. LBS Capital Management
• Its system uses expert systems, neural nets, and genetic algorithms to manage
portfolios totaling $600 million 8
2/4/2018
Contd...
3. Fraud detection :
• HNC Falcon and Nestor PRISM systems are used for monitoring credit card
fraud, watching over millions of accounts
• The FAIS system, is used to identify financial transactions that might indicate
money laundering activity
4. Manufacturing :
• The CASSIOPEE troubleshooting system, used to diagnose and predict
problems for the Boeing 737
• Faults, clustering methods are used
• CASSIOPEE received the European first prize for innovative application
9
2/4/2018
Contd...
5. Telecommunications :
• The telecommunications alarm-sequence analyzer (TASA) used a frequently
occurring alarm episodes from the alarm stream and presenting them as rules
6. Data cleaning :
• The MERGE-PURGE system was applied to the identification of duplicate
welfare claims
• IBM’s ADVANCED SCOUT, that helps National Basketball Association (NBA)
coaches organize and interpret data from NBA games
10
2/4/2018 11
4. Basic Definitions
• KD is the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data
• Data are a set of facts
• Pattern is an expression in some language describing a subset of the data or a
model applicable to the subset
• Process implies steps, like data preparation, search for patterns, knowledge
evaluation, and refinement etc.
• Data mining is a step in the KD process that consists of applying data analysis
and discovery algorithms that, to produce a patterns (or models) over the data
2/4/2018 12
5. The KD Process
• The KDD process is interactive and iterative, involving numerous steps
1. Identifying the goal
• Understanding of the application domain
• Relevant prior knowledge
2. Creating a target data set
• Selecting a data set or data samples, on which discovery is to be performed
3. Data cleaning and preprocessing
• Removing noise if appropriate
• Deciding on strategies for handling missing data fields
2/4/2018
Contd...
4. Data reduction and projection
• Finding useful features to represent the data depending on the goal of the task
• With dimensionality reduction methods, the effective number of variables
under consideration can be reduced Exploratory
5. Analysis and model and hypothesis selection
• Choosing the datamining algorithm(s) and selecting method(s) to be used for
searching for data patterns
6. Data Mining
• Searching for patterns of interest in a particular representational form
Implementation on KD
13
2/4/2018
Contd...
7. Interpreting mined patterns
• visualization of the extracted patterns
8. Implementation
• Using the knowledge directly
• Incorporating the knowledge into another system for further action
• Simply documenting it
• Reporting it to interested parties
14
Contd...
Figure 1: An Overview of the Steps That Compose the KD Process
2/4/2018 15
6. The Data-Mining Step of the KD Process
• KD Goals :
1. Verification : The system is limited to verifying the user’s hypothesis
2. Discovery : The system autonomously finds new patterns
• Prediction : The system finds patterns for predicting the future
behavior of some entities
• Description : The system finds patterns for presentation to a user in a human-
understandable form
• Data mining involves fitting models to, or determining patterns from, observed data
2/4/2018 16
6.1 Data-Mining Methods
• Primary Goals of Data Mining
1. Prediction : Uses some variables or fields in the database to predict unknown
or future values of other variables of interest
2. Description : Finds human-interpretable patterns describing the data
• Data-mining methods:
• Classification • Regression
• Clustering • Summarization
• Dependency Modeling • Change and deviation detection
2/4/2018 17
2/4/2018
Contd...
1. Classification :
• It is learning a function that maps (classifies) a data item into one of several
predefined classes
• Fraud detection and credit risk applications are particularly well suited to this
type of analysis
• Types of classification models
1. Classification by decision tree induction
2. Bayesian Classification
3. Neural Networks
4. Support Vector Machines (SVM)
18
Contd...
Figure 2: A Simple Linear Classification Boundary for the Loan Data Set.The
shaped region denotes class no loan
2/4/2018 19
• x’s represent persons who have
defaulted on their loans
• o’s represent persons whose
loans are in good status with the
bank
2/4/2018
Contd...
20
2. Regression :
• It is learning a function that maps a data item to a real-valued prediction variable
• It establishes a relationship between dependent variable (Y) and one or
more independent variables (X) using a best fit straight line
• It is represented by an equation Y=a+b*X + e
• a is intercept, b is slope of the line and e is error term
• This equation can be used to predict the value of target variable based on
given predictor variable(s)
Contd...
Figure 3: A Simple Linear Regression for the Weight and Height Data Set
https://www.analyticsvidhya.com/wp content/uploads/2015/08/Linear_Regression1.png
2/4/2018 21
2/4/2018
Contd...
• Eg.
1. Estimating the probability that a patient will survive given the results of a set
of diagnostic tests
2. Predicting the amount of biomass present in a forest given remotely sensed
microwave measurements
• Types of regression methods
1. Linear Regression
2. Multivariate Linear Regression
3. Nonlinear Regression
4. Multivariate Nonlinear Regression
22
2/4/2018
Contd...
23
3. Clustering :
• Clustering can be said as identification of similar classes of objects
• Clustering can identify dense and sparse regions in object space and can
discover overall distribution pattern and correlations among data attributes
• Types of Clustering models
1. Partitioning Methods
2. Hierarchical Agglomerative (divisive) methods
3. Density based methods
4. Grid-based methods
5. Model-based methods
Contd...
Figure 4: A Simple Clustering of the Age and Purchase Power Data Set into Three Cluster
2/4/2018 24
2/4/2018
Contd...
25
3. Summarization :
• It involves methods for finding a compact description for a subset of data
• Eg.
• Tabulating the mean and standard deviations for all fields
• Discovery of functional relationships between variables
• Summarization techniques are often applied to interactive exploratory data
analysis and automated report generation
4. Change and deviation detection:
• Focuses on discovering the most significant changes in the data from previously
measured or normative values
2/4/2018
Contd...
26
5. Dependency modeling :
• Consists of finding a model that describes significant dependencies between
variables
• Dependency models exist at two levels :
• Structural level: specifies (often in graphic form) which variables are locally
dependent on each other
• Quantitative level: specifies the strengths of the dependencies using some
numeric scale
• Eg. Based on historical sale data, retailers might find out that customers always
buy cookies when they buy beers
6.2 The Components of Data-Mining Algorithms
• Three primary components in any data-mining algorithm:
1. Model representation : It is the language used to describe discoverable patterns
2. Model-evaluation criteria : Estimates how well a particular pattern (a model
and its parameters) meet the criteria of the KD process
3. Search method : consists of two components
1. Parameter search :
• It searches for the parameters which optimize the model evaluation criteria
given observed data and a fixed model representation
2. Model search :
• It occurs as a loop over the parameter search method
• The model representation is changed so that a family of models are considered
2/4/2018 27
7. Some Data-Mining Algorithms
1. Decision Trees and Rules :
• An internal node is a test on an attribute
• A branch represents an outcome of the test, e.g., Color=red
• A leaf node represents a class label or class label distribution
• At each node, one attribute is chosen to split training examples into distinct
classes as much as possible
• A new instance is classified by following a matching path to a leaf node
2/4/2018 28
29 Figure 5: Weather Data
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Contd...
2/4/2018
30
overcast
high normal falsetrue
rain
No NoYes Yes
Yes
Outlook
Humidity Windy
sunny
Contd...
Figure 6: Weather Data Tree2/4/2018
2. Nonlinear Regression and Classification Methods :
• It is a techniques for prediction that fit linear and nonlinear combinations of
basis functions to combinations of the input variables
• Eg. feedforward neural networks, adaptive spline methods, and projection
pursuit regression
2/4/2018 31
Contd...
32
Figure 7:An Example of Classification Boundaries Learned by a Nonlinear
Classifier (Such as a Neural Network) for the Loan Data Set
Contd...
2/4/2018
3. Example-Based Methods :
• Predictions on new examples are derived from the properties of similar examples
in the model whose prediction is known
• Eg. Nearest-neighbor classification and regression algorithms and case-based
reasoning systems
• Disadvantages:
• Well-defined distance metric for evaluating the distance between data points is
required
• Eg. If we used loan, sex, and profession, as variable then it would require more
effort to define a sensible metric
2/4/2018 33
Contd...
34
Figure 8: Classification Boundaries for a Nearest-Neighbor
Classifier for the Loan Data Set
Contd...
2/4/2018
4. Probabilistic Graphic Dependency Models :
• It specify probabilistic dependencies between variables using a graph structure
• These models were initially developed within the framework of probabilistic
expert systems
• Model-evaluation criteria are typically Bayesian in form
• parameter estimation can be a mixture of closed-form estimates and iterative
methods depending on whether a variable is directly observed or hidden
• Although still primarily in the research phase, the graphic form of the model
lends itself easily to human interpretation hence has huge impact on KD
2/4/2018 35
Contd...
8. Research and Application Challenges
1. Larger Databases :
• Databases with hundreds of fields and tables and millions of records and of a
multi gigabyte size are beginning to appear
• Possible solutions :
• More efficient algorithms sampling, approximation, and massively parallel
processing
2. High Dimensionality :
• There can also be a large number of fields (attributes, variables) hence the
dimensionality of the problem is high
2/4/2018 36
• A high-dimensional data set creates problems in terms of increasing the size of
the search space for model
• It increases the chances that a data-mining algorithm will find spurious
patterns
3. Overfitting
• It is a modeling error which occurs when a function is too closely fit to a limited
set of data points
• It result in a poor performance of the model on test data
• Possible solutions :
• Cross-validation, regularization, and other sophisticated statistical strategies2/4/2018 37
Contd...
38
Figure 9:Overfitting
https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/300px-
Overfitting.svg.png
Contd...
• The green line represents an
overfitted model and the black
line represents a regularised
model
• While the green line best follows
the training data, it is likely to
have a higher error rate on new
unseen data
2/4/2018
4. Changing data and knowledge :
• Rapidly changing (nonstationary) data can make previously discovered patterns
invalid
• The variables measured in a given application database can be modified, deleted,
or augmented with new measurements over time
• Possible solutions
• Incremental methods for updating the patterns and
• Treating change as an opportunity for discovery by using it to cue the search
for patterns of change only
2/4/2018 39
Contd...
5. Missing and noisy data :
• This problem is especially acute in business databases
• U.S. census data reportedly have error rates as great as 20 percent in some fields
• Important attributes can be missing if the database was not designed with
discovery in mind
• Possible solutions :
• More sophisticated statistical strategies to identify hidden variables and
dependencies
2/4/2018 40
Contd...
2/4/2018
6. Understandability of patterns :
• It is important to make the discoveries more understandable by humans
• Possible solutions
• Graphic representations ,rule structuring, natural language generation, and
techniques for visualization of data and knowledge
• Rule-refinement strategies can be used to address a related problem
7. Complex relationships between fields :
• Data-mining algorithms have been developed for simple attribute-value records
• New techniques for deriving relations between variables are being developed
41
Contd...
2/4/2018
• Hierarchically structured attributes or values, relations between attributes for
representing knowledge will require algorithms that can effectively use such
information
8. User interaction and prior knowledge
• Current KD methods and tools are not truly interactive
• It cannot easily incorporate prior knowledge about a problem except in simple
ways
• The use of domain knowledge is important in all the steps of the KD process
• Bayesian approaches use prior probabilities over data and distributions as one
form of encoding prior knowledge 42
Contd...
2/4/2018
9. Integration with other systems :
• A standalone discovery system might not be very useful
• Integration with a database management system, spreadsheets and visualization
tools, and accommodating of real-time sensor readings
43
Contd...
9. Conclusion
1. Some definitions of basic notions in the KD field was presented
2. The relation between knowledge discovery and data mining was clarified
3. A brief overview of the KD process and basic data-mining methods was provided
4. Although various algorithms and applications might appear quite different on the
surface, they share many common components
5. Understanding data mining and model induction at this component level makes it
easier for the user to understand its overall applicability to the KD process
6. A common framework for the common overall goals and methods used in KDD
was provided
2/4/2018 44

More Related Content

Similar to From data mining to knowledge discovery in

DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEDATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
IJDKP
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
Shuvra Ghosh
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
Vaibhav Dhattarwal
 
Data mining and business intelligence
Data mining and business intelligenceData mining and business intelligence
Data mining and business intelligence
chirag patil
 
6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining
INFOGAIN PUBLICATION
 
0912f50eedb48e44d7000000
0912f50eedb48e44d70000000912f50eedb48e44d7000000
0912f50eedb48e44d7000000
Rakesh Sharma
 
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGTHE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
csijjournal
 
Introduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data WarehousingIntroduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data Warehousing
Kamal Acharya
 
Data mining
Data miningData mining
Data mining
Annies Minu
 
Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...
Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...
Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...
ijtsrd
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
Kartik Kalpande Patil
 
The Survey of Data Mining Applications And Feature Scope
The Survey of Data Mining Applications  And Feature Scope The Survey of Data Mining Applications  And Feature Scope
The Survey of Data Mining Applications And Feature Scope
IJCSEIT Journal
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET Journal
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
IOSRjournaljce
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
IRJET Journal
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
Sunny Gandhi
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
ijdpsjournal
 
An analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining TechniquesAn analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining Techniques
ijcnes
 

Similar to From data mining to knowledge discovery in (20)

DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEDATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVE
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Data mining and business intelligence
Data mining and business intelligenceData mining and business intelligence
Data mining and business intelligence
 
6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining6 ijaems sept-2015-6-a review of data security primitives in data mining
6 ijaems sept-2015-6-a review of data security primitives in data mining
 
0912f50eedb48e44d7000000
0912f50eedb48e44d70000000912f50eedb48e44d7000000
0912f50eedb48e44d7000000
 
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKINGTHE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
THE EFFECTIVENESS OF DATA MINING TECHNIQUES IN BANKING
 
Introduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data WarehousingIntroduction to Data Mining and Data Warehousing
Introduction to Data Mining and Data Warehousing
 
Data mining
Data miningData mining
Data mining
 
Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...
Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...
Applications, Techniques and Trends of Data Mining and Knowledge Discovery Da...
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
The Survey of Data Mining Applications And Feature Scope
The Survey of Data Mining Applications  And Feature Scope The Survey of Data Mining Applications  And Feature Scope
The Survey of Data Mining Applications And Feature Scope
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 
An analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining TechniquesAn analysis and impact factors on Agriculture field using Data Mining Techniques
An analysis and impact factors on Agriculture field using Data Mining Techniques
 

More from Raj Kumar Ranabhat

Sales forcedemo
Sales forcedemoSales forcedemo
Sales forcedemo
Raj Kumar Ranabhat
 
Sales force
Sales forceSales force
Sales force
Raj Kumar Ranabhat
 
Security
SecuritySecurity
Kruskal's algorithm
Kruskal's algorithmKruskal's algorithm
Kruskal's algorithm
Raj Kumar Ranabhat
 
The Adoption of Knowledge Management Systems in Small Firms
The Adoption of Knowledge Management Systems in Small Firms The Adoption of Knowledge Management Systems in Small Firms
The Adoption of Knowledge Management Systems in Small Firms
Raj Kumar Ranabhat
 
Visual notation
Visual notationVisual notation
Visual notation
Raj Kumar Ranabhat
 
Take-Grant Protection Model
Take-Grant Protection ModelTake-Grant Protection Model
Take-Grant Protection Model
Raj Kumar Ranabhat
 
Visual Notation
Visual NotationVisual Notation
Visual Notation
Raj Kumar Ranabhat
 

More from Raj Kumar Ranabhat (8)

Sales forcedemo
Sales forcedemoSales forcedemo
Sales forcedemo
 
Sales force
Sales forceSales force
Sales force
 
Security
SecuritySecurity
Security
 
Kruskal's algorithm
Kruskal's algorithmKruskal's algorithm
Kruskal's algorithm
 
The Adoption of Knowledge Management Systems in Small Firms
The Adoption of Knowledge Management Systems in Small Firms The Adoption of Knowledge Management Systems in Small Firms
The Adoption of Knowledge Management Systems in Small Firms
 
Visual notation
Visual notationVisual notation
Visual notation
 
Take-Grant Protection Model
Take-Grant Protection ModelTake-Grant Protection Model
Take-Grant Protection Model
 
Visual Notation
Visual NotationVisual Notation
Visual Notation
 

Recently uploaded

The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
Nicholas Montgomery
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
National Information Standards Organization (NISO)
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
ak6969907
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
simonomuemu
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 

Recently uploaded (20)

The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
writing about opinions about Australia the movie
writing about opinions about Australia the moviewriting about opinions about Australia the movie
writing about opinions about Australia the movie
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
Pollock and Snow "DEIA in the Scholarly Landscape, Session One: Setting Expec...
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024World environment day ppt For 5 June 2024
World environment day ppt For 5 June 2024
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Smart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICTSmart-Money for SMC traders good time and ICT
Smart-Money for SMC traders good time and ICT
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 

From data mining to knowledge discovery in

  • 1. From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth AI Magazine Volume 17 Number 3 (1996) (© AAAI) Presented by: Raj Kumar Ranabhat M.E in Computer Engineering(I/II) Kathmandu University 2/4/2018 1
  • 2. Table of Content: 1. Introduction 2. Why Do We Need KD? 3. Data Mining and Knowledge Discovery in the Real World 4. Basic Definitions 5. The KD Process 6. The Data-Mining Step of the KD Process 1. Data Mining Methods 2. The Components of Data Mining Algorithms 2/4/2018 2
  • 3. Contd... 7. Some Data-Mining Methods 1. Decision Trees and Rules 2. Nonlinear Regression and Classification Methods 3. Example-Based Methods 4. Probabilistic Graphic Dependency Models 8. Research and Application Challenges 9. Conclusion 2/4/2018 3
  • 4. 1. Introduction • Across a wide variety of fields, data are being collected and accumulated at a dramatic pace • There is urgent need on extracting useful information (knowledge) from the rapidly growing volumes of digital data • The Knowledge discovery (KD) field is concerned with the development of methods and techniques for making sense of data • KD process is mapping of low-level data into other forms that might be more compact ,more abstract or more useful 2/4/2018 4
  • 5. 5 2. Why Do We Need KD ? • The traditional method of turning data into knowledge relies on manual analysis and interpretation • Eg. in the health-care industry • Specialists periodically analyze current trends and changes in health-care data • The specialists then provide a report detailing the analysis to the health-care organization • This report becomes the basis for future decision making and planning for health-care management • For these (and many other) applications, this form of manual probing of a data set is slow, expensive, and highly subjective2/4/2018
  • 6. 2/4/2018 Contd... • As data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains • Computational techniques to unearth meaningful patterns and structures from the massive volumes of data • KD is an attempt to address a problem that the digital information era made a fact of life for all of us: data overload • Businesses use KD to gain competitive advantage, increase efficiency, and provide more valuable services to customers 6
  • 7. 7 3. Data Mining and KD in the Real World • KD applications and have been deployed on large-scale real-world problems in science and in business • Eg. SKICAT, a system used by astronomers to perform image analysis, cataloging and classification of sky objects from sky-survey images • Used to process the 3 terabytes (1012 bytes) of image data • It is estimated that on the order of 109 sky objects are detectable • SKICAT can outperform humans and traditional computational techniques in classifying faint sky objects 2/4/2018
  • 8. 2/4/2018 Contd... • KD application areas : 1. Marketing : • Analyze customer databases to identify different customer groups and forecast their behavior • Eg. If customer bought X, he/she is also likely to buy Y and Z 2. Investment : • Numerous companies use data mining for investment • Eg. LBS Capital Management • Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million 8
  • 9. 2/4/2018 Contd... 3. Fraud detection : • HNC Falcon and Nestor PRISM systems are used for monitoring credit card fraud, watching over millions of accounts • The FAIS system, is used to identify financial transactions that might indicate money laundering activity 4. Manufacturing : • The CASSIOPEE troubleshooting system, used to diagnose and predict problems for the Boeing 737 • Faults, clustering methods are used • CASSIOPEE received the European first prize for innovative application 9
  • 10. 2/4/2018 Contd... 5. Telecommunications : • The telecommunications alarm-sequence analyzer (TASA) used a frequently occurring alarm episodes from the alarm stream and presenting them as rules 6. Data cleaning : • The MERGE-PURGE system was applied to the identification of duplicate welfare claims • IBM’s ADVANCED SCOUT, that helps National Basketball Association (NBA) coaches organize and interpret data from NBA games 10
  • 11. 2/4/2018 11 4. Basic Definitions • KD is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data • Data are a set of facts • Pattern is an expression in some language describing a subset of the data or a model applicable to the subset • Process implies steps, like data preparation, search for patterns, knowledge evaluation, and refinement etc. • Data mining is a step in the KD process that consists of applying data analysis and discovery algorithms that, to produce a patterns (or models) over the data
  • 12. 2/4/2018 12 5. The KD Process • The KDD process is interactive and iterative, involving numerous steps 1. Identifying the goal • Understanding of the application domain • Relevant prior knowledge 2. Creating a target data set • Selecting a data set or data samples, on which discovery is to be performed 3. Data cleaning and preprocessing • Removing noise if appropriate • Deciding on strategies for handling missing data fields
  • 13. 2/4/2018 Contd... 4. Data reduction and projection • Finding useful features to represent the data depending on the goal of the task • With dimensionality reduction methods, the effective number of variables under consideration can be reduced Exploratory 5. Analysis and model and hypothesis selection • Choosing the datamining algorithm(s) and selecting method(s) to be used for searching for data patterns 6. Data Mining • Searching for patterns of interest in a particular representational form Implementation on KD 13
  • 14. 2/4/2018 Contd... 7. Interpreting mined patterns • visualization of the extracted patterns 8. Implementation • Using the knowledge directly • Incorporating the knowledge into another system for further action • Simply documenting it • Reporting it to interested parties 14
  • 15. Contd... Figure 1: An Overview of the Steps That Compose the KD Process 2/4/2018 15
  • 16. 6. The Data-Mining Step of the KD Process • KD Goals : 1. Verification : The system is limited to verifying the user’s hypothesis 2. Discovery : The system autonomously finds new patterns • Prediction : The system finds patterns for predicting the future behavior of some entities • Description : The system finds patterns for presentation to a user in a human- understandable form • Data mining involves fitting models to, or determining patterns from, observed data 2/4/2018 16
  • 17. 6.1 Data-Mining Methods • Primary Goals of Data Mining 1. Prediction : Uses some variables or fields in the database to predict unknown or future values of other variables of interest 2. Description : Finds human-interpretable patterns describing the data • Data-mining methods: • Classification • Regression • Clustering • Summarization • Dependency Modeling • Change and deviation detection 2/4/2018 17
  • 18. 2/4/2018 Contd... 1. Classification : • It is learning a function that maps (classifies) a data item into one of several predefined classes • Fraud detection and credit risk applications are particularly well suited to this type of analysis • Types of classification models 1. Classification by decision tree induction 2. Bayesian Classification 3. Neural Networks 4. Support Vector Machines (SVM) 18
  • 19. Contd... Figure 2: A Simple Linear Classification Boundary for the Loan Data Set.The shaped region denotes class no loan 2/4/2018 19 • x’s represent persons who have defaulted on their loans • o’s represent persons whose loans are in good status with the bank
  • 20. 2/4/2018 Contd... 20 2. Regression : • It is learning a function that maps a data item to a real-valued prediction variable • It establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line • It is represented by an equation Y=a+b*X + e • a is intercept, b is slope of the line and e is error term • This equation can be used to predict the value of target variable based on given predictor variable(s)
  • 21. Contd... Figure 3: A Simple Linear Regression for the Weight and Height Data Set https://www.analyticsvidhya.com/wp content/uploads/2015/08/Linear_Regression1.png 2/4/2018 21
  • 22. 2/4/2018 Contd... • Eg. 1. Estimating the probability that a patient will survive given the results of a set of diagnostic tests 2. Predicting the amount of biomass present in a forest given remotely sensed microwave measurements • Types of regression methods 1. Linear Regression 2. Multivariate Linear Regression 3. Nonlinear Regression 4. Multivariate Nonlinear Regression 22
  • 23. 2/4/2018 Contd... 23 3. Clustering : • Clustering can be said as identification of similar classes of objects • Clustering can identify dense and sparse regions in object space and can discover overall distribution pattern and correlations among data attributes • Types of Clustering models 1. Partitioning Methods 2. Hierarchical Agglomerative (divisive) methods 3. Density based methods 4. Grid-based methods 5. Model-based methods
  • 24. Contd... Figure 4: A Simple Clustering of the Age and Purchase Power Data Set into Three Cluster 2/4/2018 24
  • 25. 2/4/2018 Contd... 25 3. Summarization : • It involves methods for finding a compact description for a subset of data • Eg. • Tabulating the mean and standard deviations for all fields • Discovery of functional relationships between variables • Summarization techniques are often applied to interactive exploratory data analysis and automated report generation 4. Change and deviation detection: • Focuses on discovering the most significant changes in the data from previously measured or normative values
  • 26. 2/4/2018 Contd... 26 5. Dependency modeling : • Consists of finding a model that describes significant dependencies between variables • Dependency models exist at two levels : • Structural level: specifies (often in graphic form) which variables are locally dependent on each other • Quantitative level: specifies the strengths of the dependencies using some numeric scale • Eg. Based on historical sale data, retailers might find out that customers always buy cookies when they buy beers
  • 27. 6.2 The Components of Data-Mining Algorithms • Three primary components in any data-mining algorithm: 1. Model representation : It is the language used to describe discoverable patterns 2. Model-evaluation criteria : Estimates how well a particular pattern (a model and its parameters) meet the criteria of the KD process 3. Search method : consists of two components 1. Parameter search : • It searches for the parameters which optimize the model evaluation criteria given observed data and a fixed model representation 2. Model search : • It occurs as a loop over the parameter search method • The model representation is changed so that a family of models are considered 2/4/2018 27
  • 28. 7. Some Data-Mining Algorithms 1. Decision Trees and Rules : • An internal node is a test on an attribute • A branch represents an outcome of the test, e.g., Color=red • A leaf node represents a class label or class label distribution • At each node, one attribute is chosen to split training examples into distinct classes as much as possible • A new instance is classified by following a matching path to a leaf node 2/4/2018 28
  • 29. 29 Figure 5: Weather Data Outlook Temperature Humidity Windy Play? sunny hot high false No sunny hot high true No overcast hot high false Yes rain mild high false Yes rain cool normal false Yes rain cool normal true No overcast cool normal true Yes sunny mild high false No sunny cool normal false Yes rain mild normal false Yes sunny mild normal true Yes overcast mild high true Yes overcast hot normal false Yes rain mild high true No Contd... 2/4/2018
  • 30. 30 overcast high normal falsetrue rain No NoYes Yes Yes Outlook Humidity Windy sunny Contd... Figure 6: Weather Data Tree2/4/2018
  • 31. 2. Nonlinear Regression and Classification Methods : • It is a techniques for prediction that fit linear and nonlinear combinations of basis functions to combinations of the input variables • Eg. feedforward neural networks, adaptive spline methods, and projection pursuit regression 2/4/2018 31 Contd...
  • 32. 32 Figure 7:An Example of Classification Boundaries Learned by a Nonlinear Classifier (Such as a Neural Network) for the Loan Data Set Contd... 2/4/2018
  • 33. 3. Example-Based Methods : • Predictions on new examples are derived from the properties of similar examples in the model whose prediction is known • Eg. Nearest-neighbor classification and regression algorithms and case-based reasoning systems • Disadvantages: • Well-defined distance metric for evaluating the distance between data points is required • Eg. If we used loan, sex, and profession, as variable then it would require more effort to define a sensible metric 2/4/2018 33 Contd...
  • 34. 34 Figure 8: Classification Boundaries for a Nearest-Neighbor Classifier for the Loan Data Set Contd... 2/4/2018
  • 35. 4. Probabilistic Graphic Dependency Models : • It specify probabilistic dependencies between variables using a graph structure • These models were initially developed within the framework of probabilistic expert systems • Model-evaluation criteria are typically Bayesian in form • parameter estimation can be a mixture of closed-form estimates and iterative methods depending on whether a variable is directly observed or hidden • Although still primarily in the research phase, the graphic form of the model lends itself easily to human interpretation hence has huge impact on KD 2/4/2018 35 Contd...
  • 36. 8. Research and Application Challenges 1. Larger Databases : • Databases with hundreds of fields and tables and millions of records and of a multi gigabyte size are beginning to appear • Possible solutions : • More efficient algorithms sampling, approximation, and massively parallel processing 2. High Dimensionality : • There can also be a large number of fields (attributes, variables) hence the dimensionality of the problem is high 2/4/2018 36
  • 37. • A high-dimensional data set creates problems in terms of increasing the size of the search space for model • It increases the chances that a data-mining algorithm will find spurious patterns 3. Overfitting • It is a modeling error which occurs when a function is too closely fit to a limited set of data points • It result in a poor performance of the model on test data • Possible solutions : • Cross-validation, regularization, and other sophisticated statistical strategies2/4/2018 37 Contd...
  • 38. 38 Figure 9:Overfitting https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/300px- Overfitting.svg.png Contd... • The green line represents an overfitted model and the black line represents a regularised model • While the green line best follows the training data, it is likely to have a higher error rate on new unseen data 2/4/2018
  • 39. 4. Changing data and knowledge : • Rapidly changing (nonstationary) data can make previously discovered patterns invalid • The variables measured in a given application database can be modified, deleted, or augmented with new measurements over time • Possible solutions • Incremental methods for updating the patterns and • Treating change as an opportunity for discovery by using it to cue the search for patterns of change only 2/4/2018 39 Contd...
  • 40. 5. Missing and noisy data : • This problem is especially acute in business databases • U.S. census data reportedly have error rates as great as 20 percent in some fields • Important attributes can be missing if the database was not designed with discovery in mind • Possible solutions : • More sophisticated statistical strategies to identify hidden variables and dependencies 2/4/2018 40 Contd...
  • 41. 2/4/2018 6. Understandability of patterns : • It is important to make the discoveries more understandable by humans • Possible solutions • Graphic representations ,rule structuring, natural language generation, and techniques for visualization of data and knowledge • Rule-refinement strategies can be used to address a related problem 7. Complex relationships between fields : • Data-mining algorithms have been developed for simple attribute-value records • New techniques for deriving relations between variables are being developed 41 Contd...
  • 42. 2/4/2018 • Hierarchically structured attributes or values, relations between attributes for representing knowledge will require algorithms that can effectively use such information 8. User interaction and prior knowledge • Current KD methods and tools are not truly interactive • It cannot easily incorporate prior knowledge about a problem except in simple ways • The use of domain knowledge is important in all the steps of the KD process • Bayesian approaches use prior probabilities over data and distributions as one form of encoding prior knowledge 42 Contd...
  • 43. 2/4/2018 9. Integration with other systems : • A standalone discovery system might not be very useful • Integration with a database management system, spreadsheets and visualization tools, and accommodating of real-time sensor readings 43 Contd...
  • 44. 9. Conclusion 1. Some definitions of basic notions in the KD field was presented 2. The relation between knowledge discovery and data mining was clarified 3. A brief overview of the KD process and basic data-mining methods was provided 4. Although various algorithms and applications might appear quite different on the surface, they share many common components 5. Understanding data mining and model induction at this component level makes it easier for the user to understand its overall applicability to the KD process 6. A common framework for the common overall goals and methods used in KDD was provided 2/4/2018 44