Task Relevant DATA, Discretization
and concept Hierarchy
 Subject: Data Mining & Business Intelligence
 CE-B
 Maulik togadiya 130240107090
Task Relevant DATA
 This specifies the portions of the database or the set of data in which the
user is interested. This includes the database attributes or data warehouse
dimensions of interest (referred to as the relevant attributes or
dimensions).
 This portion include the following:
Data warehouse name
Database table
Condition for data selection
Dimension
Data grouping criteria
Example
 If a data mining task is to study associations between items frequently
purchased at AllElectronics by customers in india, the task relevant data
can be specified by providing the following information:
 Name of the database or data warehouse to be used (e.g., AllElectronics_db)
 Names of the tables or data cubes containing relevant data (e.g., item,
customer, purchases and items_sold)
 Conditions for selecting the relevant data (e.g., retrieve data for purchases
made in india for the current year)
 The relevant attributes or dimensions (e.g., name and price from the item table
and income and age from the customer table)
The Kind of Knowledge to be mined
 Characterization
 Discrimination
 Association
 Classification/prediction
 Clustering
 Outlier analysis
 Other data mining tasks
Concept Hierarchies
 A concept hierarchy is explain a sequence of mapping from a set of low-
level concept to high-level more general concept.
 Different type of concept hierarchies:
Schema hierarchy
Set grouping hierarchy
Operation-derived hierarchy
Rule-based hierarchy
Schema hierarchy
 A schema hierarchy is total or partial order among the attribute in the
database schema. This hierarchy may formally express semantic relation
between attributes.
 Schema hierarchy of a relation for address containing the attributes street
city state and country:
house_number < street < city < state <country
 In this example house_number is at a conceptually low level then street ,
which is lower then city or state which is conceptually lower then country.
Set grouping hierarchy
 Organizes values for a given attribute into groups or sets or range of
values.
 Total or partial order can be defined among groups.
 Used to refine or enrich schema-defined hierarchies.
 Example: Set-grouping hierarchy for age
{young, middle_aged, senior} all (age)
{20….29} young
{40….59} middle_aged
{60….89} senior
Set grouping hierarchy
All ages
Young senior
middel_aged
…..
{20,21…..29} {60,61,…89}
{40,41…59}
Operation_derived hierarchy
 An operation_derived hierarchy is based on operation specified by an
users , experts or by the mining systems. Operation can be include the
decoding of information, encoding and extracting from complex data
clustering.
example: markovz@cs.ccsu.edu
 instantiates the hierarchy user−name < department < university <
usa−univeristy.
Rule-based hierarchy
 A rule based hierarchy either a whole concept hierarchy or a portion of it
is defined by a set of rules and is evaluated dynamically based on current
data and definition.
Example: define hierarchy profit_margin_hierarchy on item as
 level_1: low_profit_margin < level_0: all if (price - cost)< $50
 level_1: medium_profit_margin < level_0: all if ((price - cost) > $50) and
((price - cost) <= $250))
 level_1: high_profit_margin < level_0: all if (price - cost) > $250
Discretization and Concept hierarchy
 Discretization
 Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals.
 Concept hierarchies
 Reduce the data by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level concepts (such as young,
middle-aged, or senior).
Discretization and concept hierarchy
generation for numeric data
 Binning
 Histogram analysis
 Clustering analysis
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Histogram analysis
 Partitioning rule is applied to define range of values.
 Divide data into buckets and store average (sum) for each bucket.
Clustering analysis
 Partition data into groups or cluster.
 Clustering is a process of partitioning a set of data (or objects) into a set of
meaningful sub-classes, called clusters.
 Help users understand the natural grouping or structure in a data set.
Clustering analysis
Concept Hierarchy Generation for Categorical
Data
 Specification of a partial/total ordering of attributes explicitly at the schema level
by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data grouping
 {Ahmedabad, Surat, Rajkot} < Gujarat
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the analysis of the
number of distinct values
 E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on the analysis of
the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level
of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Data mining

Data mining

  • 1.
    Task Relevant DATA,Discretization and concept Hierarchy  Subject: Data Mining & Business Intelligence  CE-B  Maulik togadiya 130240107090
  • 2.
    Task Relevant DATA This specifies the portions of the database or the set of data in which the user is interested. This includes the database attributes or data warehouse dimensions of interest (referred to as the relevant attributes or dimensions).  This portion include the following: Data warehouse name Database table Condition for data selection Dimension Data grouping criteria
  • 3.
    Example  If adata mining task is to study associations between items frequently purchased at AllElectronics by customers in india, the task relevant data can be specified by providing the following information:  Name of the database or data warehouse to be used (e.g., AllElectronics_db)  Names of the tables or data cubes containing relevant data (e.g., item, customer, purchases and items_sold)  Conditions for selecting the relevant data (e.g., retrieve data for purchases made in india for the current year)  The relevant attributes or dimensions (e.g., name and price from the item table and income and age from the customer table)
  • 4.
    The Kind ofKnowledge to be mined  Characterization  Discrimination  Association  Classification/prediction  Clustering  Outlier analysis  Other data mining tasks
  • 5.
    Concept Hierarchies  Aconcept hierarchy is explain a sequence of mapping from a set of low- level concept to high-level more general concept.  Different type of concept hierarchies: Schema hierarchy Set grouping hierarchy Operation-derived hierarchy Rule-based hierarchy
  • 6.
    Schema hierarchy  Aschema hierarchy is total or partial order among the attribute in the database schema. This hierarchy may formally express semantic relation between attributes.  Schema hierarchy of a relation for address containing the attributes street city state and country: house_number < street < city < state <country  In this example house_number is at a conceptually low level then street , which is lower then city or state which is conceptually lower then country.
  • 7.
    Set grouping hierarchy Organizes values for a given attribute into groups or sets or range of values.  Total or partial order can be defined among groups.  Used to refine or enrich schema-defined hierarchies.  Example: Set-grouping hierarchy for age {young, middle_aged, senior} all (age) {20….29} young {40….59} middle_aged {60….89} senior
  • 8.
    Set grouping hierarchy Allages Young senior middel_aged ….. {20,21…..29} {60,61,…89} {40,41…59}
  • 9.
    Operation_derived hierarchy  Anoperation_derived hierarchy is based on operation specified by an users , experts or by the mining systems. Operation can be include the decoding of information, encoding and extracting from complex data clustering. example: markovz@cs.ccsu.edu  instantiates the hierarchy user−name < department < university < usa−univeristy.
  • 10.
    Rule-based hierarchy  Arule based hierarchy either a whole concept hierarchy or a portion of it is defined by a set of rules and is evaluated dynamically based on current data and definition. Example: define hierarchy profit_margin_hierarchy on item as  level_1: low_profit_margin < level_0: all if (price - cost)< $50  level_1: medium_profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250))  level_1: high_profit_margin < level_0: all if (price - cost) > $250
  • 11.
    Discretization and Concepthierarchy  Discretization  Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals.  Concept hierarchies  Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
  • 12.
    Discretization and concepthierarchy generation for numeric data  Binning  Histogram analysis  Clustering analysis
  • 13.
    Binning Methods forData Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 14.
    Histogram analysis  Partitioningrule is applied to define range of values.  Divide data into buckets and store average (sum) for each bucket. Clustering analysis  Partition data into groups or cluster.  Clustering is a process of partitioning a set of data (or objects) into a set of meaningful sub-classes, called clusters.  Help users understand the natural grouping or structure in a data set.
  • 15.
  • 16.
    Concept Hierarchy Generationfor Categorical Data  Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts  street < city < state < country  Specification of a hierarchy for a set of values by explicit data grouping  {Ahmedabad, Surat, Rajkot} < Gujarat  Specification of only a partial set of attributes  E.g., only street < city, not others  Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values  E.g., for a set of attributes: {street, city, state, country}
  • 17.
    Automatic Concept HierarchyGeneration  Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values