Dm1.1

Data Mining
Ajith G.S: poposir.orgfree.com
DATA MINING

• Extracting Knowledge
• Knowledge mining from data
• Knowledge Discovery from Data (KDD)
Data Mining

Data Mining

• KDD Process Steps
• 1) Data Clearing
• 2) Data Integration
• 3) Data Selection
• 4) Data transformation
• 5) Data mining
• 6) Pattern evaluation
• 7) Knowledge Presentation
Data Mining

• KDD Process Steps
• 1) Data Clearing – remove noise and inconsistent data
• 2) Data Integration – combine multiple data source
• 3) Data Selection – select relevant data for analysis
• 4) Data transformation – convert into needed format
• 5) Data mining – apply methods to extract data pattern
• 6) Pattern evaluation – select needed pattern to represent
knowledge
• 7) Knowledge Presentation – diff visualization techniques
Data Mining

• Data Mining is a step in knowledge discovery process
•
Data Mining

• Architecture of data mining system
• .
Data Mining

• Components are
• Database, Data ware house, World wide web, other
information repository
• - data cleaning and integration techniques may be performed
on the data
• Database or data ware house server
• - responsible for fetching needed data
•
Data Mining

• Knowledge base
• - used to guide the search
• Data mining Engine
• - task such as characterization, association, correlation analysis,
classification, ..
• Pattern evaluation module
• - to select needed patterns
• User interface
• - user communication
Data Mining

• It deals with a number of different data repositories on which mining can
be performed.
• Can be applicable to any kinds of repositories as well as data streams.
• Data Repositories like
• Relational Databases
• Data Warehouses
• Transactional Databases
• Advanced database systems
• Flat files
• Data streams
• WWW
Data Mining- On What Kinds of Data

• Advanced database systems like
• Object relational databases
• Temporal, sequence and time series database
• Spatial databases
• Multimedia databases

• DBMS - Collection of interrelated data + set of software programs
to access and manage the data
• Relational Database - A collection of tables, each of which is
assigned a unique name
• Each table consist of a set of attributes and stores a large set of
tuples
• Tuple represents an object identified by a unique key and described
by a set of attribute values
•

• Relational data can be accessed by relational query language
such as SQL or with assistance of GUI.
• A given query is transformed into relational operations such as
join, selection and projection
• Data mining in relational database  Searching for data
patterns Example: To predict credit risk of new customers
based on the data available in the database.
• Relational DB is most commonly available and is a rich
information repository.

• Data Warehouse
• It is a repository of information collected from multiple sources
stored under a unified schema and that usually resides at a
single site.
• Constructed using Data Cleaning, Integration,
Transformation, Loading and Periodic data refreshing.
• In a data warehouse rather than storing details it may store a
summary of the data from a historical perspective.

• Data Warehouse
• Multidimensional database structure Dimension- An attribute
or a set of attribute in the schema. Cell- Aggregate measure
• Usually by a multidimensional data cube.
• Data mart Department subset of a data warehouse and
focuses on selected subjects
• OLAP operations Roll up, Drill down

• Typical framework of a data warehouse

• Multidimensional data cube

• Transactional Database
• Consist of a file where each record represents a transaction.
• Includes a unique transaction identity number and list of items
making up the transaction
• Example: Transactional database for sales “Which items sold
well together?” Data mining for transactional data identifies
frequent item sets easily

• Advanced Data and Information Systems and Advanced
Applications
• Object Relational Databases
• Temporal Databases, Sequence Databases and Time-Series
Database
• Spatial Databases and spatio-temporal databases
• Text Databases and Multimedia Databases
• Heterogeneous Databases and legacy Databases
• Data Streams
• WWW

• Advanced Data and Information Systems and Advanced Applications
• Object Relational Databases
• Handles complex objects
• Each entity is considered as an object Individual items, employees etc.
• Data and code relating to an object are encapsulated into a single unit
• Each object has
• A set of variables Attributes
• A set of messages to communicate with other objects
• A set of methods Holds the code to implement the message
• Object class Objects that share a common set of properties
• Each object is an instance of a class.

• Temporal Databases, Sequence Databases and Time-Series Database
• Temporal databases handles data involving time Stores relational data
that include time related attributes
• Sequence Databases stores sequence of ordered events with or with out a
concrete notion of time. Example Customer shopping sequences
• Time Series Databases stores sequence of values or events obtained over
repeated measurements of time. Example  Data collected from the stock
exchange.
• Data mining techniques can be used to find the trends of changes for
objects in the database.

Applications
• Spatial Databases and spatio-temporal databases
• Spatial database contains objects defined geometric space
Example Maps, CAD databases
• Using data mining the relationship among a set of spatial
objects can be examined
• Spatio temporal databases  Spatial DBs that stores spatial
objects that change with time Example : Tracking of moving
vehicles

Applications
• Text Databases and Multimedia Databases
• Text databases contains word descriptions for objects Long
sequence of paragraphs. Example : Product specification
• Text databases may highly unstructured(Web pages on WWW),
semi structured(email) and well structured.
• By mining text data we can uncover general and concise
descriptions of the text documents, keywords etc.
• Multimedia databases store image, audio and video data Must
support large objects

Applications
• Heterogeneous Databases and Legacy databases
• Heterogeneous databases consist of a set of interconnected
component databases where the objects in the component
databases differ greatly.
• Legacy database is a group of heterogeneous databases
• Information exchange across these databases is very difficult
due to diverse semantics Data mining is a solution by
transforming the data into higher and more generalized levels

Applications
• Data Streams
• New kind of data where the data flow in and out of an
observation platform dynamically.
• Example: Video Surveillance
• Data streams are normally not stored in any kind of
repository Challenges to management and analysis
• Uses continuous query model

• World Wide Web
• Data objects are linked together to facilitate interactive access.
• Opportunity as well as challenge to data mining
• Web usage mining Capturing user access pattern in distributed
information environment
• Keyword-based search offer limited help to users
• Authoritative web page analysis Rank webpages based on their
importance
• Automated web page clustering and classification Arrange web pages
based on their contents
• Web community analysis Identifies hidden social networks and
communities

• What kinds of patterns can be mined?
• Used to specify the kind of patterns to be found in data mining
tasks.
• Tasks can be classified into 2:
• Descriptive  Deals with the general properties of data in the
database
• Predictive  Perform inference on the current data in order to
make predictions
Data Mining Functionalities

• Concept/ Class Description: Characterization and
Discrimination
• Mining frequent Patterns, Association and Correlations
• Classification and Prediction
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis

• Concept/ Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
• Example:
• classes of items for sales - computer and printers
• concepts of customers - big spenders and budget spenders
• Using precise terms we can describe individual classes and concepts.
• Such descriptions of a class or a concept are called class/concept descriptions
• These descriptions can be derived via
• Data Characterization − This refers to summarizing data of class under study -
Target Class
• Data Discrimination − By comparison of the target class with one or a set of
comparative classes- Contrasting classes
• Both the above methods

• Mining frequent Patterns
• Patterns that occur frequently in transactional data.
• Frequent Item Set − It refers to a set of items that frequently
appear together - milk and bread
• Frequent Subsequence − A sequence of patterns that occur
frequently - purchasing a camera is followed by memory card
• Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined
with item−sets or subsequences.
• Mining frequent patterns lead to the discovery of interesting
associations and correlations within the data

• Association and Correlations
• Association Rules: 2 types
• Single dimensional association rules
• Multi-dimensional association rules

• Association and Correlations
• The association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a minimum
confidence threshold.
• Confidence- Certainty
• Support- indication of how frequently the items appear in the
database

• Classification
• Classification is the process of finding a model that describes the data
classes or concepts.
• This derived model is based on the analysis of sets of training data- Known
class labels
• Using this model to predict the class of objects whose class label is
unknown.
• The derived model can be presented in the following forms −
• (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks

• Classification & Prediction

• Prediction
• Models continuous valued functions
• It is used to predict missing or unavailable numerical data
values rather than class labels.
• Regression Analysis is generally used for prediction.

• Analyzes data objects without consulting a known class label
• The objects are clustered or grouped based on the principle of
“ maximizing the intra-class similarity and minimizing the
interclass similarity”
• Within a cluster the data objects will have high similarity but
dissimilar to objects in other clusters

• Outlier Analysis
• Outliers- Data objects in a database that do not obey the
general behavior or model of data.
• In some applications, the rare events can be more interesting
than the regularly occurring ones Fraud detection Outlier
mining

• Evolution Analysis
• Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over
time.

Data Mining Classification of Data Mining System

• Classification according to the kinds of database mined
• Data models (Relational, Transactional, Object relational)
• Type of data (spatial, time series, text, stream , multimedia,
WWW)
• Classification according to the kinds of knowledge mined
• Based on different data mining functionalities
• According to the level of abstraction of knowledge mined
• According to the regularity or irregularity of data that is mined

• Classification according to the kinds of techniques utilized
• Degree of user interactions involved
• Methods of data analysis involved (database oriented or data
warehouse oriented etc)
• Classification according to the applications adapted
• Finance
• Tele communication
• DNA

• Each user will have a data mining task, to perform a task with
help of data mining query
• Query is defined as Data mining task primitives Allow the
users to interact with the data mining system.
• DMQL Data Mining query Language
Data Mining Task Primitives

• The primitives specify
• The set of task relevant data to be mined
• Specifies the portions of database or the set of data in which the
user is interested
• It includes
• Database or data warehouse name
• Database tables or Data warehouse cubes
• Conditions for data selection
• Relevant attributes or dimensions
• Data grouping criteria

• The kind of knowledge to be mined
• Specifies the data mining functions to be performed
• Characterization
• Discrimination
• Association/ Correlation
• Classification/Prediction
• Clustering
• Outlier or Evolution Analysis

• The background knowledge to be used in the discovery process
• Knowledge about the domain to be mined
• Guides the knowledge discovery process and evaluations of
the patterns found
• User beliefs regarding the relationships in the data

• The interestingness measures and threshold for pattern
evaluation
• Used to guide the mining process or evaluation of the
discovered patterns
• Different kind of knowledge have different interestingness
measures
• eg
• Support
• Confidence

• The expected representation for visualizing the discovered patterns
• Refers to the form in which discovered patterns are to be displayed
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes

• Integration of Data Mining System with Database or Data
Warehouse System

• When DM work in an environment, it required to communicate
with other information components such DB and DW
• Diff integration schema are
• No coupling
• Loose coupling
• Semi tight coupling
• Tight coupling
Integration of Data Mining System with Database or Data Warehouse System

• No coupling
• A DM system will not use facilities of a DB / DW system
• Fetch data from a particular source(file) and process the data
and stores the results in another file.
• Simple integration scheme
• Drawbacks
• Wastage of time for preprocessing the data
• Use other tools to extract data
• Poor Design

• Loose coupling
• A data mining system will use some facilities of a DB / DW
system
• Fetch data from a data repository and process the data and
stores the results in DB or DW
• It fetch the data using query processing, indexing and other
DB/DW system facilities
• Drawback
• Difficult to achieve high scalability and good performance with
large data sets

• Semi tight coupling
• Essential data mining primitives are provided in the DB/DW system
• Sorting
• Indexing
• Aggregation
• Histogram Analysis
• Pre-computation of statistical measures
• Also some frequently used intermediate mining results can be pre-
computed and stored in a DB/DW system.
• The design will enhance the performance of a DM system

• Tight coupling
• Smoothly integrated into the DB/DW system
• DM system is treated as one functional component of an
information system
• Data mining queries and functions are optimized based on
different methods of DB/DW system.

• Data mining is not an easy task,
• The algorithms use very complex data is not always available at
one place
• Needs to be integrated from various heterogeneous data
sources.
• Common Issues are
• Mining methodology and user interaction Issues
• Performance Issues
• Issues related to the different types of database
Issues in Data Mining

• Mining different kinds of knowledge in the databases
• Different users may be interested in different kinds of knowledge.
It should cover a broad range of knowledge discovery
task(classification, clustering)
• Uses the same database in different ways
• Interactive mining of knowledge at multiple levels of abstraction
• The data mining process needs to be interactive  allows users to
focus the search for patterns, providing and refining data mining
requests based on the returned results.
• Enables the user to view the data from different angles and level
of abstractions
Issues in Data Mining Mining methodology and user interaction Issues

• Incorporation of background knowledge(knowledge about the
domain under study)
• To guide discovery process and to express the discovered patterns,
the background knowledge can be used Express the discovered
patterns not only in concise terms but at multiple levels of
abstraction.
• Data mining query languages and ad hoc data mining
• Data Mining query language that allows the user to describe ad hoc
mining tasks should be developed.
• These languages should be integrated with a database or data
warehouse query language and optimized for efficient and flexible
data mining.

• Presentation and visualization of data mining results
• Once the patterns are discovered it needs to be expressed in
high level languages, and visual representations.
• These representations should be easily understandable
• Handling noisy and incomplete data
• The data cleaning methods are required to handle the noise
and incomplete objects while mining the data regularities.
• If the data cleaning methods are not there then the accuracy
of the discovered patterns will be poor

• Pattern evaluation
• The patterns discovered may be uninteresting because either
they represent common knowledge or lack novelty
• To guide the discovery process and reduce the search space,
interestingness measures or user specified constraints should
be there.

• Efficiency and scalability of data mining algorithm
• In order to effectively extract the information from huge
amount of data in databases
• The running time must be predictable and scalable.
• Parallel, distributed and incremental mining algorithms
• These algorithms divide the data into partitions which is
further processed in a parallel fashion.
• Then the results from the partitions is merged.
• The incremental algorithms, update databases without mining
the data again from scratch.
Issues in Data Mining Performance Issues

• Handling of relational and complex types of data
• The database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc.
• It is not possible for one system to mine all these kind of data.
• Mining information from heterogeneous databases and global
information systems
• The data is available at different data sources on LAN or WAN.
• These data source may be structured, semi structured or
unstructured.
• Therefore mining the knowledge from them adds challenges to
data mining.
Issues in Data Mining Issues relating to the diversity of database types

• When
Issues in Data Mining

Dm1.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dm1.1

Similar to Dm1.1 (20)

Recently uploaded

Recently uploaded (20)

Dm1.1