The document discusses concept description in data mining. It covers data generalization and summarization to characterize data at a higher conceptual level. This involves abstracting data from lower to higher conceptual levels through techniques like attribute removal, generalization and aggregation. Analytical characterization analyzes attribute relevance, while mining class comparisons allows discriminating between classes. Descriptive statistical measures can also be mined from large databases. Applications include telecommunications, social network analysis and intrusion detection.
The document discusses multidimensional databases and data warehousing. It describes multidimensional databases as optimized for data warehousing and online analytical processing to enable interactive analysis of large amounts of data for decision making. It discusses key concepts like data cubes, dimensions, measures, and common data warehouse schemas including star schema, snowflake schema, and fact constellations.
Data warehousing and online analytical processingVijayasankariS
The document discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. It describes key concepts such as data warehouse modeling using data cubes and dimensions, extraction, transformation and loading of data, and common OLAP operations. The document also provides examples of star schemas and how they are used to model data warehouses.
The document discusses frequent itemset mining methods. It describes the Apriori algorithm which uses a candidate generation-and-test approach involving joining and pruning steps. It also describes the FP-Growth method which mines frequent itemsets without candidate generation by building a frequent-pattern tree. The advantages of each method are provided, such as Apriori being easily parallelized but requiring multiple database scans.
A hash function usually means a function that compresses, meaning the output is shorter than the input
A hash function takes a group of characters (called a key) and maps it to a value of a certain length (called a hash value or hash).
The hash value is representative of the original string of characters, but is normally smaller than the original.
This term is also known as a hashing algorithm or message digest function.
Hash functions also called message digests or one-way encryption or hashing algorithm.
http://phpexecutor.com
Hierarchical clustering methods group data points into a hierarchy of clusters based on their distance or similarity. There are two main approaches: agglomerative, which starts with each point as a separate cluster and merges them; and divisive, which starts with all points in one cluster and splits them. AGNES and DIANA are common agglomerative and divisive algorithms. Hierarchical clustering represents the hierarchy as a dendrogram tree structure and allows exploring data at different granularities of clusters.
This document discusses association rule mining. Association rule mining finds frequent patterns, associations, correlations, or causal structures among items in transaction databases. The Apriori algorithm is commonly used to find frequent itemsets and generate association rules. It works by iteratively joining frequent itemsets from the previous pass to generate candidates, and then pruning the candidates that have infrequent subsets. Various techniques can improve the efficiency of Apriori, such as hashing to count itemsets and pruning transactions that don't contain frequent itemsets. Alternative approaches like FP-growth compress the database into a tree structure to avoid costly scans and candidate generation. The document also discusses mining multilevel, multidimensional, and quantitative association rules.
Clustering is an unsupervised learning technique used to group unlabeled data points together based on similarities. It aims to maximize similarity within clusters and minimize similarity between clusters. There are several clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Clustering has many applications such as pattern recognition, image processing, market research, and bioinformatics. It is useful for extracting hidden patterns from large, complex datasets.
The document discusses multidimensional databases and data warehousing. It describes multidimensional databases as optimized for data warehousing and online analytical processing to enable interactive analysis of large amounts of data for decision making. It discusses key concepts like data cubes, dimensions, measures, and common data warehouse schemas including star schema, snowflake schema, and fact constellations.
Data warehousing and online analytical processingVijayasankariS
The document discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. It describes key concepts such as data warehouse modeling using data cubes and dimensions, extraction, transformation and loading of data, and common OLAP operations. The document also provides examples of star schemas and how they are used to model data warehouses.
The document discusses frequent itemset mining methods. It describes the Apriori algorithm which uses a candidate generation-and-test approach involving joining and pruning steps. It also describes the FP-Growth method which mines frequent itemsets without candidate generation by building a frequent-pattern tree. The advantages of each method are provided, such as Apriori being easily parallelized but requiring multiple database scans.
A hash function usually means a function that compresses, meaning the output is shorter than the input
A hash function takes a group of characters (called a key) and maps it to a value of a certain length (called a hash value or hash).
The hash value is representative of the original string of characters, but is normally smaller than the original.
This term is also known as a hashing algorithm or message digest function.
Hash functions also called message digests or one-way encryption or hashing algorithm.
http://phpexecutor.com
Hierarchical clustering methods group data points into a hierarchy of clusters based on their distance or similarity. There are two main approaches: agglomerative, which starts with each point as a separate cluster and merges them; and divisive, which starts with all points in one cluster and splits them. AGNES and DIANA are common agglomerative and divisive algorithms. Hierarchical clustering represents the hierarchy as a dendrogram tree structure and allows exploring data at different granularities of clusters.
This document discusses association rule mining. Association rule mining finds frequent patterns, associations, correlations, or causal structures among items in transaction databases. The Apriori algorithm is commonly used to find frequent itemsets and generate association rules. It works by iteratively joining frequent itemsets from the previous pass to generate candidates, and then pruning the candidates that have infrequent subsets. Various techniques can improve the efficiency of Apriori, such as hashing to count itemsets and pruning transactions that don't contain frequent itemsets. Alternative approaches like FP-growth compress the database into a tree structure to avoid costly scans and candidate generation. The document also discusses mining multilevel, multidimensional, and quantitative association rules.
Clustering is an unsupervised learning technique used to group unlabeled data points together based on similarities. It aims to maximize similarity within clusters and minimize similarity between clusters. There are several clustering methods including partitioning, hierarchical, density-based, grid-based, and model-based. Clustering has many applications such as pattern recognition, image processing, market research, and bioinformatics. It is useful for extracting hidden patterns from large, complex datasets.
This document contains a data structures question paper from Anna University. It has two parts:
Part A contains 10 short answer questions covering topics like ADT, linked stacks, graph theory, algorithm analysis, binary search trees, and more.
Part B contains 5 long answer questions each worth 16 marks. Topics include algorithms for binary search, linear search, recursion, sorting, trees, graphs, files, and more. Students are required to write algorithms, analyze time complexity, and provide examples for each question.
This document discusses various data reduction techniques including dimensionality reduction through attribute subset selection, numerosity reduction using parametric and non-parametric methods like data cube aggregation, and data compression. It describes how attribute subset selection works to find a minimum set of relevant attributes to make patterns easier to detect. Methods for attribute subset selection include forward selection, backward elimination, and bi-directional selection. Decision trees can also help identify relevant attributes. Data cube aggregation stores multidimensional summarized data to provide fast access to precomputed information.
This document summarizes Chapter 5 of the textbook "Data Mining: Concepts and Techniques". It discusses concept description, which involves characterizing data through generalization, summarization, and comparison of different classes. Key aspects covered include data cube approaches to characterization, attribute-oriented induction for generalization, analytical characterization of attribute relevance, and presenting generalized results through cross-tabulation, visualization, and rules. Implementation can utilize pre-computed data cubes to enable efficient analysis operations like drill-down.
Decision trees are a type of supervised learning algorithm used for classification and regression. ID3 and C4.5 are algorithms that generate decision trees by choosing the attribute with the highest information gain at each step. Random forest is an ensemble method that creates multiple decision trees and aggregates their results, improving accuracy. It introduces randomness when building trees to decrease variance.
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
1. Merkle tree is a fundamental part of blockchain technology.
2. It is a mathematical data structure composed of hashes of different blocks of data, and which serves as a summary of all the transactions in a block.
3. It also allows for efficient and secure verification of content in a large body of data. It also helps to verify the consistency and content of the data. Both Bitcoin and Ethereum use Merkle Trees structure.
4. Merkle Tree is also known as Hash Tree.
The document discusses market basket analysis and the Apriori algorithm. Market basket analysis is used to discover frequent item sets purchased together in transaction data. The Apriori algorithm is used to find these frequent item sets by scanning transactions to count item occurrences, filtering out infrequent items, and generating candidate item sets. Frequent item sets can be used for applications like cross-selling items, proper item placement, fraud detection, understanding customer behavior, and affinity promotion.
This document discusses cryptography in blockchain. It begins by introducing blockchain and cryptography separately. It then defines important cryptography terminology like encryption, decryption, cipher, and key. It describes the main types of cryptography as symmetric-key, asymmetric-key, and hash functions. It explains how blockchain uses asymmetric-key algorithms and hash functions. Hash functions are used to link blocks and maintain integrity. Cryptography provides benefits like the avalanche effect and uniqueness to blockchain. Finally, it discusses an application of cryptography in cryptocurrency, where public-private key pairs maintain user addresses and digital signatures approve transactions.
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like entropy, information gain, and how decision trees are constructed and evaluated. Examples are given to illustrate these concepts. The document concludes with strengths and weaknesses of decision tree algorithms.
In computer science, a linked list is a linear collection of data elements, whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence.
R is an open source programming language and software environment for statistical analysis and graphics. It is widely used among data scientists for tasks like data manipulation, calculation, and graphical data analysis. Some key advantages of R include that it is open source and free, has a large collection of statistical tools and packages, is flexible, and has strong capabilities for data visualization. It also has an active user community and can integrate with other software like SAS, Python, and Tableau. R is a popular and powerful tool for data scientists.
This document provides an overview of object-oriented programming concepts in Python including objects, classes, inheritance, polymorphism, and encapsulation. It defines key terms like objects, classes, and methods. It explains how to create classes and objects in Python. It also discusses special methods, modules, and the __name__ variable.
The document discusses different clustering techniques used for grouping large amounts of data. It covers partitioning methods like k-means and k-medoids that organize data into exclusive groups. It also describes hierarchical methods like agglomerative and divisive clustering that arrange data into nested groups or trees. Additionally, it mentions density-based and grid-based clustering and provides algorithms for different clustering approaches.
The document discusses various data reduction strategies including attribute subset selection, numerosity reduction, and dimensionality reduction. Attribute subset selection aims to select a minimal set of important attributes. Numerosity reduction techniques like regression, log-linear models, histograms, clustering, and sampling can reduce data volume by finding alternative representations like model parameters or cluster centroids. Dimensionality reduction techniques include discrete wavelet transformation and principal component analysis, which transform high-dimensional data into a lower-dimensional representation.
Hashing is a technique used to uniquely identify objects by assigning each object a key, such as a student ID or book ID number. A hash function converts large keys into smaller keys that are used as indices in a hash table, allowing for fast lookup of objects in O(1) time. Collisions, where two different keys hash to the same index, are resolved using techniques like separate chaining or linear probing. Common applications of hashing include databases, caches, and object representation in programming languages.
The document is a chapter from a textbook on data mining written by Akannsha A. Totewar, a professor at YCCE in Nagpur, India. It provides an introduction to data mining, including definitions of data mining, the motivation and evolution of the field, common data mining tasks, and major issues in data mining such as methodology, performance, and privacy.
The document discusses cluster analysis, which groups data objects into clusters so that objects within a cluster are similar but dissimilar to objects in other clusters. It describes key characteristics of clustering, including that it is unsupervised learning and the clusters are determined algorithmically rather than by humans. Various clustering algorithms are covered, including partitioning, hierarchical, density-based, and grid-based methods. Applications of clustering discussed include business intelligence, image recognition, web search, outlier detection, and biology. Requirements for effective clustering in data mining are also outlined.
This document provides an overview of data mining concepts and techniques. It defines data mining as the extraction of interesting and useful patterns from large amounts of data. The document outlines several potential applications of data mining, including market analysis, risk analysis, and fraud detection. It also describes the typical steps involved in a data mining process, including data cleaning, pattern evaluation, and knowledge presentation. Finally, it discusses different data mining functionalities, such as classification, clustering, and association rule mining.
Hashing is the process of converting a given key into another value. A hash function is used to generate the new value according to a mathematical algorithm. The result of a hash function is known as a hash value or simply, a hash.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
The document discusses the components of a data mining system and data discretization. The key components of a data mining system are a data source, data mining engine, data warehouse server, pattern evaluation module, graphical user interface, and knowledge base. The data source provides the raw data, which is then cleaned, integrated and stored on the data warehouse server. The data mining engine applies various algorithms to the data to discover patterns. Pattern evaluation and the knowledge base help assess the quality of patterns found. Data discretization involves converting continuous attribute values into a finite set of intervals to simplify analysis and management of data.
This document contains a data structures question paper from Anna University. It has two parts:
Part A contains 10 short answer questions covering topics like ADT, linked stacks, graph theory, algorithm analysis, binary search trees, and more.
Part B contains 5 long answer questions each worth 16 marks. Topics include algorithms for binary search, linear search, recursion, sorting, trees, graphs, files, and more. Students are required to write algorithms, analyze time complexity, and provide examples for each question.
This document discusses various data reduction techniques including dimensionality reduction through attribute subset selection, numerosity reduction using parametric and non-parametric methods like data cube aggregation, and data compression. It describes how attribute subset selection works to find a minimum set of relevant attributes to make patterns easier to detect. Methods for attribute subset selection include forward selection, backward elimination, and bi-directional selection. Decision trees can also help identify relevant attributes. Data cube aggregation stores multidimensional summarized data to provide fast access to precomputed information.
This document summarizes Chapter 5 of the textbook "Data Mining: Concepts and Techniques". It discusses concept description, which involves characterizing data through generalization, summarization, and comparison of different classes. Key aspects covered include data cube approaches to characterization, attribute-oriented induction for generalization, analytical characterization of attribute relevance, and presenting generalized results through cross-tabulation, visualization, and rules. Implementation can utilize pre-computed data cubes to enable efficient analysis operations like drill-down.
Decision trees are a type of supervised learning algorithm used for classification and regression. ID3 and C4.5 are algorithms that generate decision trees by choosing the attribute with the highest information gain at each step. Random forest is an ensemble method that creates multiple decision trees and aggregates their results, improving accuracy. It introduces randomness when building trees to decrease variance.
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
This Decision Tree Algorithm in Machine Learning Presentation will help you understand all the basics of Decision Tree along with what Machine Learning is, what Machine Learning is, what Decision Tree is, the advantages and disadvantages of Decision Tree, how Decision Tree algorithm works with resolved examples, and at the end of the decision Tree use case/demo in Python for loan payment. For both beginners and experts who want to learn Machine Learning Algorithms, this Decision Tree tutorial is perfect.
1. Merkle tree is a fundamental part of blockchain technology.
2. It is a mathematical data structure composed of hashes of different blocks of data, and which serves as a summary of all the transactions in a block.
3. It also allows for efficient and secure verification of content in a large body of data. It also helps to verify the consistency and content of the data. Both Bitcoin and Ethereum use Merkle Trees structure.
4. Merkle Tree is also known as Hash Tree.
The document discusses market basket analysis and the Apriori algorithm. Market basket analysis is used to discover frequent item sets purchased together in transaction data. The Apriori algorithm is used to find these frequent item sets by scanning transactions to count item occurrences, filtering out infrequent items, and generating candidate item sets. Frequent item sets can be used for applications like cross-selling items, proper item placement, fraud detection, understanding customer behavior, and affinity promotion.
This document discusses cryptography in blockchain. It begins by introducing blockchain and cryptography separately. It then defines important cryptography terminology like encryption, decryption, cipher, and key. It describes the main types of cryptography as symmetric-key, asymmetric-key, and hash functions. It explains how blockchain uses asymmetric-key algorithms and hash functions. Hash functions are used to link blocks and maintain integrity. Cryptography provides benefits like the avalanche effect and uniqueness to blockchain. Finally, it discusses an application of cryptography in cryptocurrency, where public-private key pairs maintain user addresses and digital signatures approve transactions.
The document discusses decision tree algorithms. It begins with an introduction and example, then covers the principles of entropy and information gain used to build decision trees. It provides explanations of key concepts like entropy, information gain, and how decision trees are constructed and evaluated. Examples are given to illustrate these concepts. The document concludes with strengths and weaknesses of decision tree algorithms.
In computer science, a linked list is a linear collection of data elements, whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which together represent a sequence.
R is an open source programming language and software environment for statistical analysis and graphics. It is widely used among data scientists for tasks like data manipulation, calculation, and graphical data analysis. Some key advantages of R include that it is open source and free, has a large collection of statistical tools and packages, is flexible, and has strong capabilities for data visualization. It also has an active user community and can integrate with other software like SAS, Python, and Tableau. R is a popular and powerful tool for data scientists.
This document provides an overview of object-oriented programming concepts in Python including objects, classes, inheritance, polymorphism, and encapsulation. It defines key terms like objects, classes, and methods. It explains how to create classes and objects in Python. It also discusses special methods, modules, and the __name__ variable.
The document discusses different clustering techniques used for grouping large amounts of data. It covers partitioning methods like k-means and k-medoids that organize data into exclusive groups. It also describes hierarchical methods like agglomerative and divisive clustering that arrange data into nested groups or trees. Additionally, it mentions density-based and grid-based clustering and provides algorithms for different clustering approaches.
The document discusses various data reduction strategies including attribute subset selection, numerosity reduction, and dimensionality reduction. Attribute subset selection aims to select a minimal set of important attributes. Numerosity reduction techniques like regression, log-linear models, histograms, clustering, and sampling can reduce data volume by finding alternative representations like model parameters or cluster centroids. Dimensionality reduction techniques include discrete wavelet transformation and principal component analysis, which transform high-dimensional data into a lower-dimensional representation.
Hashing is a technique used to uniquely identify objects by assigning each object a key, such as a student ID or book ID number. A hash function converts large keys into smaller keys that are used as indices in a hash table, allowing for fast lookup of objects in O(1) time. Collisions, where two different keys hash to the same index, are resolved using techniques like separate chaining or linear probing. Common applications of hashing include databases, caches, and object representation in programming languages.
The document is a chapter from a textbook on data mining written by Akannsha A. Totewar, a professor at YCCE in Nagpur, India. It provides an introduction to data mining, including definitions of data mining, the motivation and evolution of the field, common data mining tasks, and major issues in data mining such as methodology, performance, and privacy.
The document discusses cluster analysis, which groups data objects into clusters so that objects within a cluster are similar but dissimilar to objects in other clusters. It describes key characteristics of clustering, including that it is unsupervised learning and the clusters are determined algorithmically rather than by humans. Various clustering algorithms are covered, including partitioning, hierarchical, density-based, and grid-based methods. Applications of clustering discussed include business intelligence, image recognition, web search, outlier detection, and biology. Requirements for effective clustering in data mining are also outlined.
This document provides an overview of data mining concepts and techniques. It defines data mining as the extraction of interesting and useful patterns from large amounts of data. The document outlines several potential applications of data mining, including market analysis, risk analysis, and fraud detection. It also describes the typical steps involved in a data mining process, including data cleaning, pattern evaluation, and knowledge presentation. Finally, it discusses different data mining functionalities, such as classification, clustering, and association rule mining.
Hashing is the process of converting a given key into another value. A hash function is used to generate the new value according to a mathematical algorithm. The result of a hash function is known as a hash value or simply, a hash.
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
The document discusses the components of a data mining system and data discretization. The key components of a data mining system are a data source, data mining engine, data warehouse server, pattern evaluation module, graphical user interface, and knowledge base. The data source provides the raw data, which is then cleaned, integrated and stored on the data warehouse server. The data mining engine applies various algorithms to the data to discover patterns. Pattern evaluation and the knowledge base help assess the quality of patterns found. Data discretization involves converting continuous attribute values into a finite set of intervals to simplify analysis and management of data.
The document provides an introduction to data mining and knowledge discovery. It discusses how large amounts of data are extracted and transformed into useful information for applications like market analysis and fraud detection. The key steps in the knowledge discovery process are described as data cleaning, integration, selection, transformation, mining, pattern evaluation, and knowledge presentation. Common data sources, database architectures, and types of coupling between data mining systems and databases are also outlined.
This document discusses data mining and dimensionality reduction techniques. It introduces data mining as the process of discovering patterns in large datasets. Dimensionality reduction techniques like principal component analysis (PCA) and linear discriminant analysis (LDA) are used to reduce the number of variables while preserving important information. PCA transforms data into a new set of variables called principal components to reduce complexity and identify patterns. LDA projects data onto a lower dimensional space to maximize separation between classes for classification tasks. Examples of applying PCA and LDA to problems like facial recognition are provided.
This document discusses various data mining techniques, including artificial neural networks. It provides an overview of the knowledge discovery in databases process and the cross-industry standard process for data mining. It also describes techniques such as classification, clustering, regression, association rules, and neural networks. Specifically, it discusses how neural networks are inspired by biological neural networks and can be used to model complex relationships in data.
This document discusses data generalization and summarization techniques. It describes how attribute-oriented induction generalizes data from low to high conceptual levels by examining attribute values. The number of distinct values for each attribute is considered, and attributes may be removed, generalized up concept hierarchies, or retained in the generalized relation. An algorithm for attribute-oriented induction takes a relational database and data mining query as input and outputs a generalized relation. Generalized data can be presented as crosstabs, bar charts, or pie charts.
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
This document discusses using k-means clustering to partition datasets that have been generated through horizontal aggregation of data from multiple database tables. It provides background on horizontal aggregation techniques like pivot tables and describes the k-means clustering algorithm. The algorithm is applied as an example to cluster a sample dataset into two groups. The document concludes that k-means clustering can effectively partition large datasets produced by horizontal aggregations to facilitate further data mining analysis.
1. Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices.
2. SVD is primarily used for dimensionality reduction, information extraction, and noise reduction.
3. Key applications of SVD include matrix approximation, principal component analysis, image compression, recommendation systems, and signal processing.
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
1. Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices.
2. SVD is primarily used for dimensionality reduction, information extraction, and noise reduction.
3. Key applications of SVD include matrix approximation, principal component analysis, image compression, recommendation systems, and signal processing.
This document provides an overview of different techniques for clustering categorical data. It discusses various clustering algorithms that have been used for categorical data, including K-modes, ROCK, COBWEB, and EM algorithms. It also reviews more recently developed algorithms for categorical data clustering, such as algorithms based on particle swarm optimization, rough set theory, and feature weighting schemes. The document concludes that clustering categorical data remains an important area of research, with opportunities to develop techniques that initialize cluster centers better.
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
This document proposes a heuristic algorithm to reduce information overload in database query results by automatically categorizing the results into a hierarchical structure. It first discusses related work in areas like data mining, OLAP, and discretization. It then presents the basics of categorizing query results, including defining a valid categorization structure and modeling how a user may explore the categorized results. The document introduces models to estimate the information overload a user faces during exploration based on the number of items examined. It formulates the categorization problem as an optimization to minimize this cost. The paper then describes a heuristic algorithm to efficiently search the space of possible categorizations to find high-quality solutions based on the cost models.
Data preprocessing is required because real-world data is often incomplete, noisy, inconsistent, and in an aggregate form. The goals of data preprocessing include handling missing data, smoothing out noisy data, resolving inconsistencies, computing aggregate attributes, reducing data volume to improve mining performance, and improving overall data quality. Key techniques for data preprocessing include data cleaning, data integration, data transformation, and data reduction.
This document discusses data mining and related topics. It begins by defining data mining as the process of discovering patterns in large datasets using methods from machine learning, statistics, and database systems. The document then discusses data warehouses, how they work, and their role in data mining. It describes different data mining functionalities and tasks such as classification, prediction, and clustering. The document outlines some common data mining applications and issues related to methodology, performance, and diverse data types. Finally, it discusses some social implications of data mining involving privacy, profiling, and unauthorized use of data.
The document provides an introduction to data mining. It defines data mining as analyzing data to understand the past and predict the future. It discusses how data mining combines statistics, machine learning, artificial intelligence, and databases. It also provides brief histories of data mining and database technology. Finally, it describes common data mining tasks like classification, prediction, regression, clustering, time series analysis, and summarization.
MS SQL SERVER: Introduction To Datamining Suing Sql Serversqlserver content
Data mining involves analyzing large datasets to discover patterns. It can be used to better understand systems by studying trends and patterns in vast amounts of data. Data mining uses classification, clustering, association, and regression algorithms to organize data and discover patterns. The data mining process involves data collection, cleaning, transformation, modeling, and assessment. Examples of data mining applications include customer relationship management, enterprise resource planning, and web log analysis.
MS Sql Server: Introduction To Datamining Suing Sql ServerDataminingTools Inc
Data mining involves analyzing large datasets to discover patterns. It can be used to better understand systems by studying trends and patterns in vast amounts of data. Data mining uses classification, clustering, association, and regression algorithms to organize data and discover patterns. The data mining process involves data collection, cleaning, transformation, modeling, and assessment. Examples of data mining applications include customer relationship management, enterprise resource planning, and analyzing web server logs.
The document provides an overview of different clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative and divisive, and density-based methods like DBSCAN and OPTICS. It discusses the basic concepts of clustering, requirements for effective clustering like scalability and ability to handle different data types and shapes. It also summarizes clustering algorithms like BIRCH that aim to improve scalability for large datasets.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Characterization and Comparison
1. UNIT - III
Concept Description:
Characterization and Comparison
By Mrs. Chetana
2. UNIT - III
• Concepts Description: Characterization and Comparision:
Data Generalization and Summarization-Based Characterization,
Analytical Characterization: Analysis of Attribute Relevance,
Mining Class Comparisons: Discriminating between Different
Classes, Mining Descriptive Statistical Measures in Large
Databases.
• Applications:
Telecommunication Industry, Social Network Analysis, Intrusion
Detection
By Mrs. Chetana
3. Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
4. What is Concept Description?
From Data Analysis point of view, data mining can be
classified into two categories :
Descriptive mining and predictive mining
◦ Descriptive mining: describes the data set in a concise and
summarative manner and presents interesting general
properties of data
◦ Predictive mining: analyzes the data in order to construct
one or a set of models, and attempts to predict the behavior
of new data sets
By Mrs. Chetana
5. What is Concept Description?
Databases usually stores large amount of data in great
detail.
However, users often like to view sets of summarized
data in concise, descriptive terms.
Such data descriptions may provide an overall picture of
a class of data or distinguish it from a set of comparative
classes.
Such descriptive data mining is called
concept descriptions and forms an important
component of data mining
By Mrs. Chetana
6. What is Concept Description?
The simplest kind of descriptive data mining is called
concept description.
A concept usually refers to a collection of data such as
frequent_buyers, graduate_students and so on.
As data mining task concept description is not a simple
enumeration of the data. Instead, concept description
generates descriptions for characterization and
comparison of the data
It is sometimes called class description, when the concept to be
described refers to a class of objects
◦ Characterization: provides a concise and brief summarization of the
given collection of data
◦ Comparison: provides descriptions comparing two or more
collections of data By Mrs. Chetana
7. Concept Description vs. OLAP
OLAP:
◦ Data warehouse and OLAP tools are based on multidimensional data
model that views data in the form of data cube, consisting of
dimensions (or attributes) and measures (aggregate functions)
◦ The current OLAP systems confine dimensions to non-numeric data.
◦ Similarly, measures such as count(), sum(), average() in current OLAP
systems apply only to numeric data.
◦ restricted to a small number of dimension and measure types
◦ user-controlled process (such as selection of dimensions and the
applications of OLAP operations such as drill down, roll up, slicing
and dicing are controlled by the users
Concept description in large databases :
◦ The database attributes can be of various types, including numeric,
nonnumeric, spatial, text or image
◦ can handle complex data types of the attributes and their aggregations
◦ a more automated process
By Mrs. Chetana
8. Concept Description vs. OLAP
Concept description:
◦ can handle complex data types of the attributes and
their aggregations
◦ a more automated process
OLAP:
◦ restricted to a small number of dimension and measure
types
◦ user-controlled process
By Mrs. Chetana
9. Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
10. Data Generalization and Summarization-
based Characterization
Data and objects in databases contain detailed information at primitive
concept level.
For ex, the item relation in a sales database may contain attributes
describing low level item information such as item_ID, name, brand,
category, supplier, place_made and price.
It is useful to be able to summarize a large set of data and present it at a
high conceptual level.
For ex. Summarizing a large set of items relating to Christmas season
sales provides a general description of such data, which can be very
helpful for sales and marketing managers.
This requires important functionality called data generalization
By Mrs. Chetana
11. Data Generalization and Summarization-
based Characterization
Data generalization
◦ A process which abstracts a large set of task-relevant data
in a database from a low conceptual levels to higher ones.
◦ Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
1
2
3
4
5
Conceptual
levels
By Mrs. Chetana
12. Characterization: Data Cube Approach
(without using AO-Induction)
Perform computations and store results in data cubes
Strength
◦ An efficient implementation of data generalization
◦ Computation of various kinds of measures
e.g., count( ), sum( ), average( ), max( )
◦ Generalization and specialization can be performed on a data cube
by roll-up and drill-down
Limitations
◦ handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
◦ Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
By Mrs. Chetana
13. Attribute-Oriented Induction (AOI)
The Attribute Oriented Induction (AOI) approach to data
generalization and summarization – based characterization was first
proposed in 1989 (KDD ‘89 workshop) a few years prior to the
introduction of the data cube approach.
The data cube approach can be considered as a data warehouse
based, pre computational oriented, materialized approach
It performs off-line aggregation before an OLAP or data mining
query is submitted for processing.
On the other hand, the attribute oriented induction approach, at least
in its initial proposal, a relational database query oriented,
generalized based, on-line data analysis technique
By Mrs. Chetana
14. Attribute-Oriented Induction (AOI)
However, there is no inherent barrier distinguishing the two
approaches based on online aggregation versus offline
precomputation.
Some aggregations in the data cube can be computed on-line,
while off-line precomputation of multidimensional space can
speed up attribute-oriented induction as well.
By Mrs. Chetana
15. Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
◦ Collect the task-relevant data( initial relation) using a relational
database query
◦ Perform generalization by attribute removal or attribute
generalization.
◦ Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
◦ reduces the size of generalized data set.
◦ Interactive presentation with users.
By Mrs. Chetana
16. Basic Principles of
Attribute-Oriented Induction
Data focusing: task-relevant data, including dimensions, and the result
is the initial relation.
Attribute-removal: remove attribute A if there is a large set of distinct
values for A but
(1) there is no generalization operator on A, or
(2) A’s higher level concepts are expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct values for A,
and there exists a set of generalization operators on A, then select an
operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control (10-30): control the final
relation/rule size.
By Mrs. Chetana
17. Basic Algorithm for Attribute-Oriented Induction
InitialRel:
Query processing of task-relevant data, deriving the initial relation.
PreGen:
Based on the analysis of the number of distinct values in each attribute,
determine generalization plan for each attribute: removal? or how high to
generalize?
PrimeGen:
Based on the PreGen plan, perform generalization to the right level to
derive a “prime generalized relation”, accumulating the counts.
Presentation:
User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping
into rules, cross tabs, visualization presentations.
By Mrs. Chetana
18. Example
DMQL: Describe general characteristics of graduate
students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
By Mrs. Chetana
19. Class Characterization: An
Example
N
a
m
e G
e
n
d
e
r M
a
j
o
r B
i
r
t
h
-
P
l
a
c
e B
i
r
t
h
_
d
a
t
e R
e
s
i
d
e
n
c
e P
h
o
n
e
# G
P
A
J
i
m
W
o
o
d
m
a
n
M C
S V
a
n
c
o
u
v
e
r
,
B
C
,
C
a
n
a
d
a
8
-
1
2
-
7
6 3
5
1
1
M
a
i
n
S
t
.
,
R
i
c
h
m
o
n
d
6
8
7
-
4
5
9
83
.
6
7
S
c
o
t
t
L
a
c
h
a
n
c
e
M C
S M
o
n
t
r
e
a
l
,
Q
u
e
,
C
a
n
a
d
a
2
8
-
7
-
7
5 3
4
5
1
s
t
A
v
e
.
,
R
i
c
h
m
o
n
d
2
5
3
-
9
1
0
63
.
7
0
L
a
u
r
a
L
e
e
…
F
…
P
h
y
s
i
c
s
…
S
e
a
t
t
l
e
,
W
A
,
U
S
A
…
2
5
-
8
-
7
0
…
1
2
5
A
u
s
t
i
n
A
v
e
.
,
B
u
r
n
a
b
y
…
4
2
0
-
5
2
3
2
…
3
.
8
3
…
R
e
m
o
v
e
d R
e
t
a
i
n
e
d S
c
i
,
E
n
g
,
B
u
s
C
o
u
n
t
r
y A
g
e
r
a
n
g
e C
i
t
y R
e
m
o
v
e
dE
x
c
l
,
V
G
,
.
.
G
e
n
d
e
r
M
a
j
o
r B
i
r
t
h
_
r
e
g
i
o
n
A
g
e
_
r
a
n
g
e
R
e
s
i
d
e
n
c
eG
P
A C
o
u
n
t
M
S
c
i
e
n
c
e C
a
n
a
d
a 2
0
-
2
5 R
i
c
h
m
o
n
d
V
e
r
y
-
g
o
o
d 1
6
F S
c
i
e
n
c
e F
o
r
e
i
g
n 2
5
-
3
0 B
u
r
n
a
b
y E
x
c
e
l
l
e
n
t 2
2
…… … … … … …
B
i
r
t
h
_
R
e
g
i
o
n
G
e
n
d
e
r
C
a
n
a
d
a F
o
r
e
i
g
n T
o
t
a
l
M 1
6 1
4 3
0
F 1
0 2
2 3
2
T
o
t
a
l 2
6 3
6 6
2
See
Principles
See Algorithm
Prime
Generalized
Relation
Initial
Relation
See Implementation
By Mrs. Chetana
20. Presentation of Generalized Results
Generalized relation:
◦ Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
Cross tabulation:
◦ Mapping results into cross tabulation form (similar to contingency tables).
Visualization techniques:
◦ Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
◦ Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad(x) Λ male(x) ⇒ birth_region(x) = “Canadd[t:53%] ∨ birth_region(x) = “foreign[t:47%]
By Mrs. Chetana
21. Implementation by Cube Technology
Construct a data cube on-the-fly for the given data mining query
◦ Facilitate efficient drill-down analysis
◦ May increase the response time
◦ A balanced solution: precomputation of “subprime” relation
Use a predefined & precomputed data cube
◦ Construct a data cube beforehand
◦ Facilitate not only the attribute-oriented induction, but also attribute
relevance analysis, dicing, slicing, roll-up and drill-down
◦ Cost of cube computation and the nontrivial storage overhead
By Mrs. Chetana
22. Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization - based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
23. Analytical Characterization
Attribute Relevance Analysis
“ What if I am not sure which attribute to include for class
characterization and class comparison ? I may end up specifying
too many attributes, which could slow down the system
considerably ”
Measures of attribute relevance analysis can be used to help
identify irrelevant or weakly relevant attributes that can be
excluded from the concept description process.
The incorporation of this processing step into class
characterization or comparison is referred to as analytical
characterization or analytical comparison
By Mrs. Chetana
24. Why Perform Attribute Relevance
Analysis??
The first limitation of OLAP tool is the handling of complex objects.
The second limitation is the lack of an automated generalization process :
the user must explicitly tell the system which dimensions should be included in the class
characterization and how high a level each dimension should be generalized.
Actually, each step of generalization or specialization on any dimension
must be specified by the user.
Usually, it is not difficult for a user to instruct a data mining system
regarding how high a level each dimension should be generalized.
For ex, users can set attribute generalization thresholds for this, or specify
which level a given dimension should reach, such as with the command
“generalize dimension location to the country level”.
By Mrs. Chetana
25. Why Perform Attribute Relevance
Analysis??
Even without explicit user instruction, a default value such as 2 to 8 can
be set by the data mining system, which would allow each dimension to
be generalized to a level that contains only 2 to 8 distinct values.
On other hand normally a user may include too few attributes in the
analysis, causing the incomplete mining results or a user may introduce
too many attributes for analysis e.g “in relevance to *”.
Methods should be introduced to perform attribute relevance analysis in
order to filter out statistically irrelevant or weakly relevant attributes
Class characterization that includes the analysis of attribute/dimension
relevance is called analytical characterization.
Class comparison that includes such analysis is called analytical
comparison
By Mrs. Chetana
26. Attribute Relevance Analysis
Why?
◦ Which dimensions should be included?
◦ How high level of generalization?
◦ Automatic vs. interactive
◦ Reduce number of attributes; easy to understand patterns
What?
◦ statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
◦ relevance related to dimensions and levels
◦ analytical characterization, analytical comparison
By Mrs. Chetana
27. Steps for Attribute relevance analysis
Data Collection :
Collect data for both the target class and the contrasting class by query processing
Preliminary relevance analysis using conservative AOI:
• This step identifies a set of dimensions and attributes on which the selected relevance
measure is to be applied.
• The relation obtained by such an application of AOI is called the candidate relation of
the mining task.
Remove irrelevant and weakly relevant attributes using the selected
relevance analysis:
• We evaluate each attribute in the candidate relation using the selected relevance
analysis measure.
• This step results in an initial target class working relation and initial contrasting class
working relation.
Generate the concept description using AOI:
• Perform AOI using a less conservative set of attribute generalization thresholds.
• If the descriptive mining is
Class characterization , only ITCWR ( Initial Target Class Working Relation)is included
Class Comparison both ITCWR and ICCWR( Initial Contrasting Class Working Relation) are
included
By Mrs. Chetana
28. Relevance Measures
Quantitative relevance measure determines the
classifying power of an attribute within a set of data.
Methods
◦ information gain (ID3)
◦ gain ratio (C4.5)
◦ gini index
◦ 2 contingency table statistics
◦ uncertainty coefficient
By Mrs. Chetana
29. Entropy and Information Gain
S contains si tuples of class Ci for i = {1, …, m}
Information measures info required to classify any arbitrary
tuple
Entropy of attribute A with values {a1,a2,…,av}
Information gained by branching on attribute A
I(s1 ,s2,...,sm )=− ∑
i= 1
m
si
s
log2
si
s
E( A)= ∑
j= 1
v
s1j+...+smj
s
I(s1j,...,smj)
Gain(A )= I(s1,s2 ,...,sm)− E( A)
By Mrs. Chetana
30. Example: Analytical Characterization
Task
◦ Mine general characteristics describing graduate students using
analytical characterization
Given
◦ Attributes :
name, gender, major, birth_place, birth_date, phone#, and gpa
◦ Gen(ai) = concept hierarchies on ai
◦ Ui = attribute analytical thresholds for ai
◦ Ti = attribute generalization thresholds for ai
◦ R = attribute relevance threshold
By Mrs. Chetana
31. Eg: Analytical Characterization (cont’d)
1. Data collection
◦ target class: graduate student
◦ contrasting class: undergraduate student
2. Analytical generalization using Ui
◦ attribute removal
remove name and phone#
◦ attribute generalization
generalize major, birth_place, birth_date and gpa
accumulate counts
◦ candidate relation(large attribute generalization threshold):
gender, major, birth_country, age_range and gpa
By Mrs. Chetana
32. Example: Analytical characterization (2)
g
e
n
d
e
r m
a
jo
r b
ir
th
_
c
o
u
n
t
r
y a
g
e
_
r
a
n
g
e g
p
a c
o
u
n
t
M S
c
ie
n
c
e C
a
n
a
d
a 2
0
-
2
5 V
e
ry
_
g
o
o
d 1
6
F S
c
ie
n
c
e F
o
r
e
ig
n 2
5
-
3
0 E
x
c
e
lle
n
t 2
2
M E
n
g
in
e
e
r
in
g F
o
r
e
ig
n 2
5
-
3
0 E
x
c
e
lle
n
t 1
8
F S
c
ie
n
c
e F
o
r
e
ig
n 2
5
-
3
0 E
x
c
e
lle
n
t 2
5
M S
c
ie
n
c
e C
a
n
a
d
a 2
0
-
2
5 E
x
c
e
lle
n
t 2
1
F E
n
g
in
e
e
r
in
g C
a
n
a
d
a 2
0
-
2
5 E
x
c
e
lle
n
t 1
8
Candidate relation for Target class: Graduate students (=120)
g
e
n
d
e
r m
a
jo
r b
ir
th
_
c
o
u
n
t
r
y a
g
e
_
r
a
n
g
e g
p
a c
o
u
n
t
M S
c
ie
n
c
e F
o
r
e
ig
n <
2
0 V
e
ry
_
g
o
o
d 1
8
F B
u
s
in
e
s
s C
a
n
a
d
a <
2
0 F
a
ir 2
0
M B
u
s
in
e
s
s C
a
n
a
d
a <
2
0 F
a
ir 2
2
F S
c
ie
n
c
e C
a
n
a
d
a 2
0
-
2
5 F
a
ir 2
4
M E
n
g
in
e
e
r
in
g F
o
r
e
ig
n 2
0
-
2
5 V
e
ry
_
g
o
o
d 2
2
F E
n
g
in
e
e
r
in
g C
a
n
a
d
a <
2
0 E
x
c
e
lle
n
t 2
4
Candidate relation for Contrasting class: Undergraduate students (=130)
By Mrs. Chetana
33. Eg: Analytical characterization (3)
3. Relevance analysis
◦ Calculate expected info required to classify an arbitrary tuple
◦ Calculate entropy of each attribute: e.g. major
0.9988
250
130
log2
250
130
250
120
log2
250
120
130
120
s2
1, =
=
)
,
I(
=
)
I(s
F
o
rm
a
jo
r
=
”
S
c
ie
n
c
e
”
: S
1
1
=
8
4 S
2
1
=
4
2 I(s1
1
,s2
1
)=
0
.9
1
8
3
F
o
rm
a
jo
r
=
”
E
n
g
in
e
e
r
in
g
”
: S
1
2
=
3
6 S
2
2
=
4
6 I(s1
2
,s2
2
)=
0
.9
8
9
2
F
o
rm
a
jo
r
=
”
B
u
s
in
e
ss”
: S
1
3
=
0 S
2
3
=
4
2 I(s1
3
,s2
3
)=
0
Number of grad
students in
“Business”
Number of undergrad
students in “Business”
By Mrs. Chetana
34. Example: Analytical Characterization (4)
Calculate expected info required to classify a given sample if S is
partitioned according to the attribute
Calculate information gain for each attribute
◦ Information gain for all attributes
E(major)=
126
250
I(s11, s21)+
82
250
I(s12 ,s22)+
42
250
I(s13 ,s23)= 0.7873
Gain(major)= I(s1,s2)− E(major)= 0.2115
Gain(gender) = 0.0003
Gain(birth_country) = 0.0407
Gain(major) = 0.2115
Gain(gpa) = 0.4490
Gain(age_range) = 0.5971
By Mrs. Chetana
35. Eg: Analytical characterization (5)
4. Initial working relation (W0) derivation
◦ R = 0.1 ( Attribute Relevant Threshold value)
◦ remove irrelevant/weakly relevant attributes from candidate relation =>
drop gender, birth_country
◦ remove contrasting class candidate relation
5. Perform attribute-oriented induction on W0 using Ti
major age_range gpa count
Science 20-25 Very_good 16
Science 25-30 Excellent 47
Science 20-25 Excellent 21
Engineering 20-25 Excellent 18
Engineering 25-30 Excellent 18
Initial target class working relation W0: Graduate students
By Mrs. Chetana
42. Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterizatio
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
43. Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
44. Class Comparisons Methods and Implementations
Data Collection: The set of relevant data in the database is collected by query
processing and is partitioned into target class and contrasting class.
Dimension relevance analysis: If there are many dimensions and analytical
comparison is desired, then dimension relevance analysis should be performed on
these classes and only the highly relevant dimensions are included in the further
analysis.
Synchronous Generalization: Generalization is performed on the target class to the
level controlled by user-or expert –specified dimension threshold, which results in a
prime target class relation/cuboid. The concepts in the contrasting class(es) are
generalized to the same level as those in the prime target class relation/cuboid,
forming the prime contrasting class relation/cuboid.
Presentation of the derived comparison: The resulting class comparison
description can be visualized in the form of tables, graphs and rules. This
presentation usually includes a “ contrasting” measure (such as count%) that reflects
the comparison between the target and contrasting classes.
By Mrs. Chetana
45. Example: Analytical comparison
Task
◦ Compare graduate and undergraduate students using discriminant
rule.
◦ DMQL query
use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place,
birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
By Mrs. Chetana
46. Example: Analytical comparison (2)
Given
◦ attributes name, gender, major, birth_place,
birth_date, residence, phone# and gpa
◦ Gen(ai) = concept hierarchies on attributes ai
◦ Ui = attribute analytical thresholds for attributes ai
◦ Ti = attribute generalization thresholds for
attributes ai
◦ R = attribute relevance threshold
By Mrs. Chetana
47. Example: Analytical comparison (3)
1. Data collection
◦ target and contrasting classes
2. Attribute relevance analysis
◦ remove attributes name, gender, major, phone#
3. Synchronous generalization
◦ controlled by user-specified dimension thresholds
◦ prime target and contrasting class(es) relations/cuboids
By Mrs. Chetana
48. Example: Analytical comparison (4)
4. Drill down, roll up and other OLAP operations on target and
contrasting classes to adjust levels of abstractions of resulting
description
5. Presentation
◦ as generalized relations, crosstabs, bar charts, pie charts, or
rules
◦ contrasting measures to reflect comparison between target
and contrasting classes
e.g. count%
By Mrs. Chetana
49. Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim
Woodman
M CS Vancouver,BC,
Canada
8-12-76 3511 Main St.,
Richmond
687-4598 3.67
Scott
Lachance
M CS Montreal, Que,
Canada
28-7-75 345 1st Ave.,
Richmond
253-9106 3.70
Laura Lee
…
F
…
Physics
…
Seattle, WA, USA
…
25-8-70
…
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Bob
Schumann
M Chem Calagary, Alt,
Canada
10-1-78 2642 Halifax St,
Burnaby
294-4291 2.96
Ammy.
Eau
F Bio Golden, BC,
Canada
30-3-76 463 Sunset
Cres, Vancouer
681-5417 3.52
… … … … … … … …
Table 5.7 Initial target class working relation (graduate student)
Table 5.8 Initial contrasting class working relation (graduate student)
By Mrs. Chetana
50. Example: Analytical comparison (5)
Major Age_range Gpa Count%
Science 20-25 Good 5.53%
Science 26-30 Good 2.32%
Science Over_30 Very_good 5.86%
… … … …
Business Over_30 Excellent 4.68%
Prime generalized relation for the target class: Graduate students
Major Age_range Gpa Count%
Science 15-20 Fair 5.53%
Science 15-20 Good 4.53%
… … … …
Science 26-30 Good 5.02%
… … … …
Business Over_30 Excellent 0.68%
Prime generalized relation for the contrasting class: Undergraduate students
By Mrs. Chetana
53. Quantitative Characteristics Rules
Cj = target class
qa = a generalized tuple covers some tuples of class
◦ but can also cover some tuples of contrasting class
t-weight
◦ range: [0, 1] or [0%, 100%]
Presentation of Class Characterization
Descriptions
A logic that is associated with the quantitative information is called
Quantitative rule . It associates an interestingness measure t-weight with
each tuple
t
i
By Mrs. Chetana
54. grad( x )∧ male( x )⇒
birthregion ( x)= ital Canada [t :53 ]∨ birthregion ( x )= ital foreign [t :47 ].
By Mrs. Chetana
55. Quantitative Discriminant Rules
Cj = target class
qa = a generalized tuple covers some tuples of class
◦ but can also cover some tuples of contrasting class
d-weight
◦ range: [0, 1]
m
=
i
Ci)
(qa
Cj)
(qa
=
d
1
count
count
weight
Presentation of Class Comparison Descriptions
To find out the discriminative features of target and contrasting classes can
be described as a discriminative rule.
It associates an interestingness measure d-weight with each tuple
By Mrs. Chetana
56. Example: Quantitative Discriminant Rule
S
ta
tu
s B
irth
_
c
o
u
n
try A
g
e
_
ra
n
g
e G
p
a C
o
u
n
t
G
ra
d
u
a
te C
a
n
a
d
a 2
5
-3
0 G
o
o
d 9
0
U
n
d
e
rg
ra
d
u
a
te C
a
n
a
d
a 2
5
-3
0 G
o
o
d 2
1
0
Count distribution between graduate and undergraduate students for a generalized
tuple
In the above ex, suppose that the count distribution
for major =‘science’ and age_range = ’20..25” and gpa =‘good’ is shown in the tables.
The d_weight would be 90/(90+210) = 30% w.r.t to target class and
The d_weight would be 210/(90+210) = 70% w.r.t to contrasting class.
i.e.The student majoring in science is 21 to 25 years old and has a good gpa then
based on the data, there is a probability that she/he is a graduate student versus a
70% probability that she/he is an undergraduate student.
Similarly the d-weights for other tuples also can be derived.
By Mrs. Chetana
57. Example: Quantitative Discriminant Rule
A Quantitative discriminant rule for the target class of a given comparison
is written in form of
Based on the above a discriminant rule for the target class graduate_student
can be written as
Note : The discriminant rule provides a sufficient condition, but not a necessary one;
for an object.
For Ex. the rule implies that if X satisfies the condition, then the probability that X
is a graduate student is 30%.
]
(X)[
(X) d_weight
:
d
condition
ss
target_cla
X,
]
good[d
=
gpa(X)
=
(X)
age
Canada
=
(X)
birth
(X)
graduate
X,
range
country
student
30
:
ital
30
-
25
ital
By Mrs. Chetana
58. Location / Item TV Computer both_items
Europe 80 240 320
North America 120 560 680
Both_regions 200 800 1000
A crosstab for the total number (count) of TVs and computers sold in thousands in 1999
To calculate T_Weight (Typicality Weight)
The formula is
1. 80 / (80+240) = 25%
2. 120 / (120+560) = 17.65%
3. 200 / (200+800) = 20%
To calculate D_Weight (Discriminate rule)
The formula is
1. 80/(80+120) = 40%
2. 120/(80+120) = 60%
3. 200/(80+120) = 100%
t_weight = count (qa)
∑
i=1
n
count (qi)
m
=
i
Ci)
(qa
Cj)
(qa
=
d
1
count
count
weight
By Mrs. Chetana
59. Class Description
Quantitative characteristic rule
◦ necessary
Quantitative discriminant rule
◦ sufficient
Quantitative description rule
◦ necessary and sufficient
n]
w
n
n(X)[
]
w
(X)[
(X)
'
'
{
:
d
,
w
:
t
condition
...
1
{
:
d
w1,
:
t
condition1
ss
target_cla
X,
]
(X)[
(X) d_weight
:
d
condition
ss
target_cla
X,
∀ X,target_class( X)⇒condition( X )[t:t_weight ]
By Mrs. Chetana
60. Example: Quantitative Description Rule
• Quantitative description rule for target class Europe
L
o
c
a
t
i
o
n
/
i
t
e
m T
V C
o
m
p
u
t
e
r B
o
t
h
_
i
t
e
m
s
C
o
u
n
t t
-
w
t d
-
w
t C
o
u
n
t t
-
w
t d
-
w
t C
o
u
n
t t
-
w
t d
-
w
t
E
u
r
o
p
e 8
0 2
5
% 4
0
% 2
4
0 7
5
% 3
0
% 3
2
0 1
0
0
% 3
2
%
N
_
A
m 1
2
0 1
7
.
6
5
% 6
0
% 5
6
0 8
2
.
3
5
% 7
0
% 6
8
0 1
0
0
% 6
8
%
B
o
t
h
_
r
e
g
i
o
n
s
2
0
0 2
0
% 1
0
0
% 8
0
0 8
0
% 1
0
0
% 1
0
0
0 1
0
0
% 1
0
0
%
Crosstab showing associated t-weight, d-weight values and total number (in thousands) of
TVs and computers sold at AllElectronics in 1998
To define a quantitative characteristic rule, we introduce the t-weight as an interestingness
measures that describes the typicality of each disjunct in the rule.
3
:
d
75
:
[t
40%]
:
d
25%,
:
[t )
com
"
(item
)
TV"
"
(item(X)
Europe(
X,
t_weight = count (qa)
∑
i=1
n
count (qi)
By Mrs. Chetana
61. Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
62. Measuring the Central Tendency
Mean
◦ Weighted arithmetic mean
Median: A holistic measure
◦ Middle value if odd number of values, or average of the middle two
values otherwise
◦ estimated by interpolation
Mode
◦ Value that occurs most frequently in the data
◦ Unimodal, bimodal, trimodal
◦ Empirical formula:
n
=
i
i
x
n
=
x
1
1
n
=
i
i
n
=
i
i
i
w
x
w
=
x
1
1
median= L1+(
n/2−(∑ f )l
f median
)c
mean− mode= 3× (mean− median)
By Mrs. Chetana
63. Measuring the Dispersion of Data
Quartiles, outliers and boxplots
◦ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
◦ Inter-quartile range: IQR = Q3 – Q1
◦ Five number summary: min, Q1, M, Q3, max
◦ Boxplot: ends of the box are the quartiles, median is marked, whiskers,
and plot outlier individually
◦ Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation
◦ Variance s2: (algebraic, scalable computation)
◦ Standard deviation s is the square root of variance s2
s2
=
1
n− 1
∑
i = 1
n
( xi− ̄x )2
=
1
n− 1
[ ∑
i = 1
n
x
i 2−
1
n
(∑
i= 1
n
xi )2
]
By Mrs. Chetana
64. Boxplot Analysis
Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
Boxplot
◦ Data is represented with a box
◦ The ends of the box are at the first and third quartiles,
i.e., the height of the box is IRQ
◦ The median is marked by a line within the box
◦ Whiskers: two lines outside the box extend to
Minimum and Maximum
By Mrs. Chetana
67. Mining Descriptive Statistical Measures in
Large Databases
Variance
Standard deviation: the square root of the
variance
◦ Measures spread about the mean
◦ It is zero if and only if all the values are equal
◦ Both the deviation and the variance are algebraic
s2
=
1
n− 1
∑
i= 1
n
( xi− ̄x)2
=
1
n− 1 [∑ xi
2
−
1
n
(∑ xi)
2
]
By Mrs. Chetana
68. Histogram Analysis
Graph displays of basic statistical class descriptions
◦ Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the counts or frequencies of
the classes present in the given data
By Mrs. Chetana
69. Quantile Plot
Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantile information
◦ For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
By Mrs. Chetana
70. Quantile-Quantile (Q-Q) Plot
Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
Allows the user to view whether there is a shift in going from
one distribution to another
By Mrs. Chetana
71. Scatter plot
Provides a first look at bivariate data to see clusters of
points, outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
By Mrs. Chetana
72. Loess Curve
Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a smoothing
parameter, and the degree of the polynomials that are fitted by
the regression
By Mrs. Chetana
73. Graphic Displays of Basic Statistical Descriptions
Histogram: (shown before)
Boxplot: (covered before)
Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter
plot to provide better perception of the pattern of dependence
By Mrs. Chetana
74. Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between
different classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
75. AO Induction vs. Learning-from-example
Paradigm
Difference in philosophies and basic assumptions
◦ Positive and negative samples in learning-from-example:
positive used for generalization, negative - for specialization
◦ Positive samples only in data mining:
hence generalization-based, to drill-down backtrack the
generalization to a previous state
Difference in methods of generalizations
◦ Machine learning generalizes on a tuple by tuple basis
◦ Data mining generalizes on an attribute by attribute basis
By Mrs. Chetana
77. Incremental and Parallel Mining of
Concept Description
Incremental mining: revision based on newly added data
DB
◦ Generalize DB to the same level of abstraction in the generalized
relation R to derive R
◦ Union R U R, i.e., merge counts and other statistical information
to produce a new relation R’
Similar philosophy can be applied to data sampling,
parallel and/or distributed mining, etc.
By Mrs. Chetana
78. Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
By Mrs. Chetana
79. Summary
Concept description: characterization and discrimination
OLAP-based vs. attribute-oriented induction
Efficient implementation of AOI
Analytical characterization and comparison
Mining descriptive statistical measures in large databases
Discussion
◦ Incremental and parallel mining of description
◦ Descriptive mining of complex types of data
By Mrs. Chetana