Data Mining Primitives, Languages & Systems

Advanced Data Mining
Lec-4: Data Mining Primitives, Languages & Systems
[Class Presentation]
Presented by
Niloy Sikder
ID: MSc 190221
CSE Discipline
Khulna University, Khulna

Mar 6, 2019 CSE, KU 1
Presentation Outline
 What are the Primitives of Data Mining?
• Task-relevant data
• Data Warehouse
• Data Cube
• Drill-down & Roll-up
• Data Selection
• Data Filtering
• Data Slicing
• Data Pivoting
• Dicing
• Data Grouping
• Clustering
• Clustering Methods
• Knowledge type to be mined
• Data Characterization
• Statistical Measures
• AOI
• Data Discrimination
• Associations and Correlations
• Classification
• Classification methods
• Prediction
• Background knowledge
• Concept Hierarchies
 System architectures of data mining
• Data Mining System Architecture
• Types of Data Mining Architectures
 Languages of data mining
• DMQL
• OLE DB
• Pattern interestingness measures
• Visualization of discovered patterns

Mar 6, 2019 CSE, KU 3
What are the Primitives of Data Mining?
 The set of task-relevant data to be mined
 The kind of knowledge to be mined
 The background knowledge
 Interestingness measures and thresholds for pattern evaluation
 The expected representation for visualizing the discovered patterns

Mar 6, 2019 CSE, KU 4
The First Primitive of Data Mining: Task-relevant Data
 Portions of the database or the set of data in which the user is interested.
Fig. 1: Task-relevant data for specifying a data mining task

Mar 6, 2019 CSE, KU 5
Task-relevant Data: Data Warehouse
 A Warehouse is a repository of information usually from multiple sources
Fig. 2: Typical framework of a data warehouse for AllElectronics.
 Usually resides at a single site
 Constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing

Mar 6, 2019 CSE, KU 6
Task-relevant Data: Data Cube
 A multidimensional data structure inside a data warehouse
Fig. 3: Summarized data for AllElectronics.
 Each dimension corresponds to an attribute
 Each cell stores the value of some aggregate measure

Mar 6, 2019 CSE, KU 7
Data Cube: Drill-down & Roll-up
 A presentation of data at different levels of abstraction
Fig. 3: Summarized data resulting drill-down and roll-up operations on the cube.
 Allow the user to view the data at differing degrees of summarization

Mar 6, 2019 CSE, KU 8
Task-relevant Data: Data Selection
 The process of retrieving relevant data to the analysis task from database
 Data can be specified by condition-based data filtering, slicing, pivoting or
dicing a data cube
Data Selection: Data Filtering
 Selective presentation or deliberate manipulation of information to make it
more acceptable or favorable to the mining model
 Reduces the content of noise or errors from raw data
 DSP – Low-pass, High-pass, Band-pass, Notch, Comb, Cut-off frequency
 DIP – Convolution, Gaussian, Bilateral, adaptive, Coye
 Database – Various SQL filters

Mar 6, 2019 CSE, KU 9
Data Selection: Data Filtering (cont.)
 Grafil (Graph Similarity Filtering), was developed to filter graphs
efficiently in large-scale graph databases

Mar 6, 2019 CSE, KU 10
Data Selection: Data Slicing
 Selecting a group of cells from the entire multidimensional array by
specifying a specific value for one or more dimensions

Mar 6, 2019 CSE, KU 11
Data Selection: Data Pivoting
 Aggregating over all dimensions except two
 Results in a two-dimensional cross tabulation reducing a dimension

Mar 6, 2019 CSE, KU 12
Data Selection: Dicing
 Selecting a subset of cells by specifying a range of attribute values
 Equivalent to defining a sub-array from the complete array

Mar 6, 2019 CSE, KU 13
Curse of Dimensionality
 Dimensionality of a data set is the number of attributes that the objects in
the data set possess
 Difficult to analyze and visualize high-dimensional data
 Data becomes increasingly sparse in the space that it occupies
 Clustering high-dimensional data is challenging
 All the dimensions may not be relevant
 Increases computational complexity
 Requires more processing power & time

Mar 6, 2019 CSE, KU 14
Task-relevant Data: Data Grouping
 Clustering is the process of grouping the data into classes or clusters
 Objects within a cluster have high similarity in comparison to one another
but are very dissimilar to objects in other clusters
 Can also be used for outlier detection

Mar 6, 2019 CSE, KU 15
Data Grouping: Clustering
Typical requirements of clustering in data mining:
 Scalability
 Ability to deal with different types of attributes/ data types
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input parameters
 Ability to deal with noisy data
 Incremental clustering and insensitivity to the order of input records
 High dimensionality
 Constraint-based clustering
 Interpretability and usability

Mar 6, 2019 CSE, KU 16
Data Grouping: Clustering Methods
 Partitioning methods:
 k-Means Method
 k-Medoids Method
 CLARANS (for large databases)
 Hierarchical methods:
 Agglomerative and Divisive
Hierarchical Clustering
 BIRCH
 ROCK
 Chameleon
 Density-based methods:
 DBSCAN
 OPTICS
 DENCLUE
 Grid-based methods:
 STING
 WaveCluster

Mar 6, 2019 CSE, KU 17
Data Grouping: Clustering Methods (cont.)
 Model-Based methods:
 Expectation-Maximization
 Conceptual Clustering
 Neural Network Approach
 Clustering high-dimensional data:
 CLIQUE
 PROCLUS

Mar 6, 2019 CSE, KU 18
The Second Primitive of Data Mining: Knowledge Types
 Important to specify the kind of knowledge to be mined, as this determines
the data mining function to be performed
 User can be more specific and provide pattern templates (metarules or
metaqueries) that all discovered patterns must match

Mar 6, 2019 CSE, KU 19
Knowledge Types: Data Characterization
 A summary of the general characteristics or features of a target class of data
 Summarizes data by replacing relatively low-level values (numeric) with
higher-level concepts (young, middle-aged, and senior)
 Several methods for effective data characterization:
 Statistical measures
 Attribute-oriented induction (AOI)
 Output can be presented in pie charts, bar charts, curves, multidimensional
data cubes, and multidimensional tables

Mar 6, 2019 CSE, KU 20
Data Characterization: Statistical Measures
 Central tendency of data – mean, weighted mean, median, mode
 Dispersion of data – range, quartiles, variance, standard deviation
 Graphical representations – histograms, boxplots, quantile plots, quantile
plots, scatter plots, scatter-plot matrices

Mar 6, 2019 CSE, KU 21
Data Characterization: AOI
 First collects the task-relevant data using a database query
 Then performs generalization based on the examination of the number of
distinct values of each attribute in the relevant set of data
 Performed through either attribute removal or attribute generalization
 Aggregation is performed by merging identical generalized tuples and
accumulating their respective counts

Mar 6, 2019 CSE, KU 22
Knowledge Types: Data Discrimination
 A comparison of the general features of target class data objects with a set
of contrasting classes
 The target and contrasting classes can be specified by the user
 They must be comparable i.e. share similar dimensions and attributes
 Data discrimination procedure:
 Data collection: query processing
 Dimension relevance analysis: select only the highly relevant dimensions for
further analysis
 Synchronous generalization: results in a prime target class relation
 Presentation of the derived comparison: tables, graphs, and rules

Mar 6, 2019 CSE, KU 23
Knowledge Types: Data Discrimination (cont.)
 Compare the general properties between the graduate and undergraduate students at
BigUniversity, given the attributes name, gender, major, birth place, birth date, residence,
phone#, and gpa. This data mining task can be expressed in DMQL as follows:
Example:
use Big University_DB
mine comparison as “grad vs undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student

Mar 6, 2019 CSE, KU 24
Knowledge Types: Associations and Correlations
 Frequent patterns, are the patterns that occur frequently in data
 buys(X; “computer”)) => buys(X; “software”) [support = 1%; confidence = 50%]
 Mining frequent patterns leads to the discovery of interesting associations
and correlations within data
 A frequent itemset refers to a set of items that frequently appear together in
a transactional data set
 age(X, “20:::29”) ^ income(X, “20K:::29K”)) => buys(X, “CD player”) [support = 2%,
confidence = 60%]

Mar 6, 2019 CSE, KU 25
 Market Basket Analysis:
Knowledge Types: Associations and Correlations (cont.)

Mar 6, 2019 CSE, KU 26
Knowledge Types: Classification
 The process of finding a model (or function) that describes and
distinguishes data classes or concepts

Mar 6, 2019 CSE, KU 27
Knowledge Types: Classification (cont.)

Mar 6, 2019 CSE, KU 28
 Classification by Decision Tree Induction
 ID3, C4,5, CART
 Bayesian Classification
Knowledge Types: Classification methods
 Rule-Based Classification
 Classification by Back-propagation
 Support Vector Machines
 Lazy Learners (or Learning from Your Neighbors)
 Genetic Algorithms
 Ensemble Methods: Bagging & Boosting
 Fuzzy Set Approaches
 Rough Set Approach

Mar 6, 2019 CSE, KU 29
 Linear Regression
 Nonlinear Regression
Knowledge Types: Prediction Methods
 Log-linear models
 Decision tree induction
 Ensemble Methods: Bagging & Boosting
 Forcasting
 The process of finding a value/ range of an attribute for a given condition
from the training dataset

Mar 6, 2019 CSE, KU 30
The Third Primitive of Data Mining: Background Knowledge
 Useful to guide the knowledge discovery process and evaluate patterns

Mar 6, 2019 CSE, KU 31
 Defines a sequence of mappings from a set of low-level concepts to higher-
level, more general concepts
 Allows data to be mined at multiple levels of abstraction
Background Knowledge: Concept Hierarchies

Mar 6, 2019 CSE, KU 32
Interestingness Measures and Thresholds for Pattern Evaluation
 May be used to guide the mining process or, after discovery, to evaluate the
discovered patterns
 Different kinds of knowledge may have different interestingness measures

Mar 6, 2019 CSE, KU 33
Visualization of Discovered Patterns
 Discovered knowledge should be expressed in high-level languages, visual
representations, or other expressive forms
 Knowledge should be easily understood and directly usable by humans
 especially crucial if the data mining system is to be interactive

Mar 6, 2019 CSE, KU 35
Data Mining Language: DMQL
 DMQL (Data Mining Query Language):
 Based on & similar to the Structured Query Language (SQL)
 Can work with databases and data warehouses as well
 Can easily be integrated with the relational query language
Example:
use database AllElectronics_db
use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age
mine classification as promising_customers
in relevance to C.age, C.income, I.type, I.place_made, T.branch
from customer C, item I, transaction T
where I.item_ID = T.item_ID and C.cust_ID = T.cust_ID
and C.income >= 40,000 and I.price >= 100
group by T.cust_ID
having sum(I.price) >= 1,000
display as rules

Mar 6, 2019 CSE, KU 36
Data Mining Language: OLE DB
 Microsoft’s OLE DB (Object Linking and Embedding, Database):
 A major step toward the standardization of data mining language primitives
and aims to become the industry standard
 Adopts many concepts in relational database systems and applies them to the
data mining field, providing a standard programming API.
 Designed to allow data mining client applications (or data mining consumers)
to consume data mining services from various data mining softwares.
 Has DMX (Data Mining eXtensions) at the core, which is SQL-like
 OLE DB for DM describes an abstraction of the data mining process:
 Model creation
 Model training
 Model prediction and browsing

Mar 6, 2019 CSE, KU 37
Data Mining Language: OLE DB (cont.)

Mar 6, 2019 CSE, KU 38
Data Mining Language: OLE DB (cont.)
Example:
create mining model prediction
( customer_ID long key,
gender text discrete,
age long discretized(),
income long continuous,
profession text discrete,
)
using Microsoft_Decision_Trees

Mar 6, 2019 CSE, KU 40
Data Mining System Architecture

Mar 6, 2019 CSE, KU 41
Types of Data Mining Architectures
 No-coupling Data Mining:
 Data mining system does not use any functionality of a database or warehouse
 Retrieves data from a particular data sources
 Does not take any advantages of a database
 Considered a poor architecture but used for simple data mining applications
 Loose Coupling Data Mining:
 System may use some of the functions of database and data warehouse system
 Fetches the data from the data respiratory managed by the system
 Stores the mining result either in a file or in a designated place in a database or
in a data warehouse
 Does not provide high scalability and high performance.

Mar 6, 2019 CSE, KU 42
Types of Data Mining Architectures (cont.)
 Semi-Tight Coupling Data Mining:
 Mining system is linked with a database or a data warehouse system
 Uses several features of data warehouse systems
 Applications include sorting, indexing & aggregation
 Efficient implementations of a few data mining primitives can be provided
 Tight Coupling Data Mining:
 Mining system is fully integrated into a database or data warehouse system
 Mining subsystem is treated as one functional component of an IR system
 Provides system scalability, high performance, and integrated information

March 06, 2019 CSE, KU 35
THANK YOU ANY QUESTIONS?

References
[1] Data Mining: Concepts and Techniques Second Edition - Jiawei Han, Micheline Kamber
[2] Introduction to Data Mining - Tan Steinbach Kumar
[3] https://data-flair.training/blogs/data-mining-architecture/
[4] https://www.tutorialspoint.com/data_mining/dm_systems.htm

Data Mining Primitives, Languages & Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Mining Primitives, Languages & Systems

Similar to Data Mining Primitives, Languages & Systems (20)

More from Niloy Sikder

More from Niloy Sikder (11)

Recently uploaded

Recently uploaded (20)

Data Mining Primitives, Languages & Systems

Editor's Notes