The document discusses data mining and provides details on key aspects of the data mining process. It describes how data mining is used to extract knowledge from data and identifies the main steps in the knowledge discovery process as data cleaning, integration, selection, transformation, mining, pattern evaluation and presentation. It also outlines different types of data that can be mined, including relational databases, data warehouses, transactional data, and advanced database systems. Common data mining techniques are discussed like classification, clustering, association rule mining and anomaly detection.
4. • KDD Process Steps
• 1) Data Clearing
• 2) Data Integration
• 3) Data Selection
• 4) Data transformation
• 5) Data mining
• 6) Pattern evaluation
• 7) Knowledge Presentation
Ajith G.S: poposir.orgfree.com
Data Mining
5. • KDD Process Steps
• 1) Data Clearing – remove noise and inconsistent data
• 2) Data Integration – combine multiple data source
• 3) Data Selection – select relevant data for analysis
• 4) Data transformation – convert into needed format
• 5) Data mining – apply methods to extract data pattern
• 6) Pattern evaluation – select needed pattern to represent
knowledge
• 7) Knowledge Presentation – diff visualization techniques
Ajith G.S: poposir.orgfree.com
Data Mining
6. • Data Mining is a step in knowledge discovery process
•
Ajith G.S: poposir.orgfree.com
Data Mining
7. • Architecture of data mining system
• .
Ajith G.S: poposir.orgfree.com
Data Mining
8. • Architecture of data mining system
• Components are
• Database, Data ware house, World wide web, other
information repository
• - data cleaning and integration techniques may be performed
on the data
• Database or data ware house server
• - responsible for fetching needed data
•
Ajith G.S: poposir.orgfree.com
Data Mining
9. • Architecture of data mining system
• Knowledge base
• - used to guide the search
• Data mining Engine
• - task such as characterization, association, correlation analysis,
classification, ..
• Pattern evaluation module
• - to select needed patterns
• User interface
• - user communication
Ajith G.S: poposir.orgfree.com
Data Mining
10. • It deals with a number of different data repositories on which mining can
be performed.
• Can be applicable to any kinds of repositories as well as data streams.
• Data Repositories like
• Relational Databases
• Data Warehouses
• Transactional Databases
• Advanced database systems
• Flat files
• Data streams
• WWW
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
11. • Advanced database systems like
• Object relational databases
• Temporal, sequence and time series database
• Spatial databases
• Multimedia databases
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
12. • Relational Databases
• DBMS - Collection of interrelated data + set of software programs
to access and manage the data
• Relational Database - A collection of tables, each of which is
assigned a unique name
• Each table consist of a set of attributes and stores a large set of
tuples
• Tuple represents an object identified by a unique key and described
by a set of attribute values
•
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
13. • Relational Databases
• Relational data can be accessed by relational query language
such as SQL or with assistance of GUI.
• A given query is transformed into relational operations such as
join, selection and projection
• Data mining in relational database Searching for data
patterns Example: To predict credit risk of new customers
based on the data available in the database.
• Relational DB is most commonly available and is a rich
information repository.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
14. • Data Warehouse
• It is a repository of information collected from multiple sources
stored under a unified schema and that usually resides at a
single site.
• Constructed using Data Cleaning, Integration,
Transformation, Loading and Periodic data refreshing.
• In a data warehouse rather than storing details it may store a
summary of the data from a historical perspective.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
15. • Data Warehouse
• Multidimensional database structure Dimension- An attribute
or a set of attribute in the schema. Cell- Aggregate measure
• Usually by a multidimensional data cube.
• Data mart Department subset of a data warehouse and
focuses on selected subjects
• OLAP operations Roll up, Drill down
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
16. • Typical framework of a data warehouse
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
17. • Multidimensional data cube
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
18. • Transactional Database
• Consist of a file where each record represents a transaction.
• Includes a unique transaction identity number and list of items
making up the transaction
• Example: Transactional database for sales “Which items sold
well together?” Data mining for transactional data identifies
frequent item sets easily
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
19. • Advanced Data and Information Systems and Advanced
Applications
• Object Relational Databases
• Temporal Databases, Sequence Databases and Time-Series
Database
• Spatial Databases and spatio-temporal databases
• Text Databases and Multimedia Databases
• Heterogeneous Databases and legacy Databases
• Data Streams
• WWW
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
20. • Advanced Data and Information Systems and Advanced Applications
• Object Relational Databases
• Handles complex objects
• Each entity is considered as an object Individual items, employees etc.
• Data and code relating to an object are encapsulated into a single unit
• Each object has
• A set of variables Attributes
• A set of messages to communicate with other objects
• A set of methods Holds the code to implement the message
• Object class Objects that share a common set of properties
• Each object is an instance of a class.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
21. • Advanced Data and Information Systems and Advanced Applications
• Temporal Databases, Sequence Databases and Time-Series Database
• Temporal databases handles data involving time Stores relational data
that include time related attributes
• Sequence Databases stores sequence of ordered events with or with out a
concrete notion of time. Example Customer shopping sequences
• Time Series Databases stores sequence of values or events obtained over
repeated measurements of time. Example Data collected from the stock
exchange.
• Data mining techniques can be used to find the trends of changes for
objects in the database.
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
22. • Advanced Data and Information Systems and Advanced
Applications
• Spatial Databases and spatio-temporal databases
• Spatial database contains objects defined geometric space
Example Maps, CAD databases
• Using data mining the relationship among a set of spatial
objects can be examined
• Spatio temporal databases Spatial DBs that stores spatial
objects that change with time Example : Tracking of moving
vehicles
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
23. • Advanced Data and Information Systems and Advanced
Applications
• Text Databases and Multimedia Databases
• Text databases contains word descriptions for objects Long
sequence of paragraphs. Example : Product specification
• Text databases may highly unstructured(Web pages on WWW),
semi structured(email) and well structured.
• By mining text data we can uncover general and concise
descriptions of the text documents, keywords etc.
• Multimedia databases store image, audio and video data Must
support large objects
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
24. • Advanced Data and Information Systems and Advanced
Applications
• Heterogeneous Databases and Legacy databases
• Heterogeneous databases consist of a set of interconnected
component databases where the objects in the component
databases differ greatly.
• Legacy database is a group of heterogeneous databases
• Information exchange across these databases is very difficult
due to diverse semantics Data mining is a solution by
transforming the data into higher and more generalized levels
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
25. • Advanced Data and Information Systems and Advanced
Applications
• Data Streams
• New kind of data where the data flow in and out of an
observation platform dynamically.
• Example: Video Surveillance
• Data streams are normally not stored in any kind of
repository Challenges to management and analysis
• Uses continuous query model
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
26. • Advanced Data and Information Systems and Advanced Applications
• World Wide Web
• Data objects are linked together to facilitate interactive access.
• Opportunity as well as challenge to data mining
• Web usage mining Capturing user access pattern in distributed
information environment
• Keyword-based search offer limited help to users
• Authoritative web page analysis Rank webpages based on their
importance
• Automated web page clustering and classification Arrange web pages
based on their contents
• Web community analysis Identifies hidden social networks and
communities
Ajith G.S: poposir.orgfree.com
Data Mining- On What Kinds of Data
27. • What kinds of patterns can be mined?
• Used to specify the kind of patterns to be found in data mining
tasks.
• Tasks can be classified into 2:
• Descriptive Deals with the general properties of data in the
database
• Predictive Perform inference on the current data in order to
make predictions
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
28. • Concept/ Class Description: Characterization and
Discrimination
• Mining frequent Patterns, Association and Correlations
• Classification and Prediction
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
29. • Concept/ Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
• Example:
• classes of items for sales - computer and printers
• concepts of customers - big spenders and budget spenders
• Using precise terms we can describe individual classes and concepts.
• Such descriptions of a class or a concept are called class/concept descriptions
• These descriptions can be derived via
• Data Characterization − This refers to summarizing data of class under study -
Target Class
• Data Discrimination − By comparison of the target class with one or a set of
comparative classes- Contrasting classes
• Both the above methods
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
30. • Mining frequent Patterns
• Patterns that occur frequently in transactional data.
• Frequent Item Set − It refers to a set of items that frequently
appear together - milk and bread
• Frequent Subsequence − A sequence of patterns that occur
frequently - purchasing a camera is followed by memory card
• Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined
with item−sets or subsequences.
• Mining frequent patterns lead to the discovery of interesting
associations and correlations within the data
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
31. • Association and Correlations
• Association Rules: 2 types
• Single dimensional association rules
• Multi-dimensional association rules
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
32. • Association and Correlations
• The association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a minimum
confidence threshold.
• Confidence- Certainty
• Support- indication of how frequently the items appear in the
database
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
33. • Classification
• Classification is the process of finding a model that describes the data
classes or concepts.
• This derived model is based on the analysis of sets of training data- Known
class labels
• Using this model to predict the class of objects whose class label is
unknown.
• The derived model can be presented in the following forms −
• (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
35. • Prediction
• Models continuous valued functions
• It is used to predict missing or unavailable numerical data
values rather than class labels.
• Regression Analysis is generally used for prediction.
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
36. • Cluster Analysis
• Analyzes data objects without consulting a known class label
• The objects are clustered or grouped based on the principle of
“ maximizing the intra-class similarity and minimizing the
interclass similarity”
• Within a cluster the data objects will have high similarity but
dissimilar to objects in other clusters
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
38. • Outlier Analysis
• Outliers- Data objects in a database that do not obey the
general behavior or model of data.
• In some applications, the rare events can be more interesting
than the regularly occurring ones Fraud detection Outlier
mining
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
39. • Evolution Analysis
• Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over
time.
Ajith G.S: poposir.orgfree.com
Data Mining Functionalities
41. • Classification according to the kinds of database mined
• Data models (Relational, Transactional, Object relational)
• Type of data (spatial, time series, text, stream , multimedia,
WWW)
• Classification according to the kinds of knowledge mined
• Based on different data mining functionalities
• According to the level of abstraction of knowledge mined
• According to the regularity or irregularity of data that is mined
Ajith G.S: poposir.orgfree.com
Data Mining Classification of Data Mining System
42. • Classification according to the kinds of techniques utilized
• Degree of user interactions involved
• Methods of data analysis involved (database oriented or data
warehouse oriented etc)
• Classification according to the applications adapted
• Finance
• Tele communication
• DNA
Ajith G.S: poposir.orgfree.com
Data Mining Classification of Data Mining System
43. • Each user will have a data mining task, to perform a task with
help of data mining query
• Query is defined as Data mining task primitives Allow the
users to interact with the data mining system.
• DMQL Data Mining query Language
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
44. • The primitives specify
• The set of task relevant data to be mined
• Specifies the portions of database or the set of data in which the
user is interested
• It includes
• Database or data warehouse name
• Database tables or Data warehouse cubes
• Conditions for data selection
• Relevant attributes or dimensions
• Data grouping criteria
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
45. • The primitives specify
• The kind of knowledge to be mined
• Specifies the data mining functions to be performed
• Characterization
• Discrimination
• Association/ Correlation
• Classification/Prediction
• Clustering
• Outlier or Evolution Analysis
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
46. • The primitives specify
• The background knowledge to be used in the discovery process
• Knowledge about the domain to be mined
• Guides the knowledge discovery process and evaluations of
the patterns found
• User beliefs regarding the relationships in the data
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
47. • The primitives specify
• The interestingness measures and threshold for pattern
evaluation
• Used to guide the mining process or evaluation of the
discovered patterns
• Different kind of knowledge have different interestingness
measures
• eg
• Support
• Confidence
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
48. • The primitives specify
• The expected representation for visualizing the discovered patterns
• Refers to the form in which discovered patterns are to be displayed
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
Ajith G.S: poposir.orgfree.com
Data Mining Task Primitives
49. • Integration of Data Mining System with Database or Data
Warehouse System
Ajith G.S: poposir.orgfree.com
50. • When DM work in an environment, it required to communicate
with other information components such DB and DW
• Diff integration schema are
• No coupling
• Loose coupling
• Semi tight coupling
• Tight coupling
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
51. • No coupling
• A DM system will not use facilities of a DB / DW system
• Fetch data from a particular source(file) and process the data
and stores the results in another file.
• Simple integration scheme
• Drawbacks
• Wastage of time for preprocessing the data
• Use other tools to extract data
• Poor Design
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
52. • Loose coupling
• A data mining system will use some facilities of a DB / DW
system
• Fetch data from a data repository and process the data and
stores the results in DB or DW
• It fetch the data using query processing, indexing and other
DB/DW system facilities
• Drawback
• Difficult to achieve high scalability and good performance with
large data sets
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
53. • Semi tight coupling
• Essential data mining primitives are provided in the DB/DW system
• Sorting
• Indexing
• Aggregation
• Histogram Analysis
• Pre-computation of statistical measures
• Also some frequently used intermediate mining results can be pre-
computed and stored in a DB/DW system.
• The design will enhance the performance of a DM system
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
54. • Tight coupling
• Smoothly integrated into the DB/DW system
• DM system is treated as one functional component of an
information system
• Data mining queries and functions are optimized based on
different methods of DB/DW system.
Ajith G.S: poposir.orgfree.com
Integration of Data Mining System with Database or Data Warehouse System
55. • Data mining is not an easy task,
• The algorithms use very complex data is not always available at
one place
• Needs to be integrated from various heterogeneous data
sources.
• Common Issues are
• Mining methodology and user interaction Issues
• Performance Issues
• Issues related to the different types of database
Ajith G.S: poposir.orgfree.com
Issues in Data Mining
56. • Mining different kinds of knowledge in the databases
• Different users may be interested in different kinds of knowledge.
It should cover a broad range of knowledge discovery
task(classification, clustering)
• Uses the same database in different ways
• Interactive mining of knowledge at multiple levels of abstraction
• The data mining process needs to be interactive allows users to
focus the search for patterns, providing and refining data mining
requests based on the returned results.
• Enables the user to view the data from different angles and level
of abstractions
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
57. • Incorporation of background knowledge(knowledge about the
domain under study)
• To guide discovery process and to express the discovered patterns,
the background knowledge can be used Express the discovered
patterns not only in concise terms but at multiple levels of
abstraction.
• Data mining query languages and ad hoc data mining
• Data Mining query language that allows the user to describe ad hoc
mining tasks should be developed.
• These languages should be integrated with a database or data
warehouse query language and optimized for efficient and flexible
data mining.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
58. • Presentation and visualization of data mining results
• Once the patterns are discovered it needs to be expressed in
high level languages, and visual representations.
• These representations should be easily understandable
• Handling noisy and incomplete data
• The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities.
• If the data cleaning methods are not there then the accuracy
of the discovered patterns will be poor
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
59. • Pattern evaluation
• The patterns discovered may be uninteresting because either
they represent common knowledge or lack novelty
• To guide the discovery process and reduce the search space,
interestingness measures or user specified constraints should
be there.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Mining methodology and user interaction Issues
60. • Efficiency and scalability of data mining algorithm
• In order to effectively extract the information from huge
amount of data in databases
• The running time must be predictable and scalable.
• Parallel, distributed and incremental mining algorithms
• These algorithms divide the data into partitions which is
further processed in a parallel fashion.
• Then the results from the partitions is merged.
• The incremental algorithms, update databases without mining
the data again from scratch.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Performance Issues
61. • Handling of relational and complex types of data
• The database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc.
• It is not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global
information systems
• The data is available at different data sources on LAN or WAN.
• These data source may be structured, semi structured or
unstructured.
• Therefore mining the knowledge from them adds challenges to
data mining.
Ajith G.S: poposir.orgfree.com
Issues in Data Mining Issues relating to the diversity of database types