This document discusses data preprocessing techniques for data mining. It covers why preprocessing is important for obtaining quality data and mining results. The major tasks of preprocessing discussed are data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves handling missing data, noisy data, and inconsistencies. Data integration combines data from multiple sources. Data transformation includes normalization, aggregation, generalization and attribute construction. The goal of preprocessing is to improve data quality and make data more suitable for mining algorithms.
The document discusses data preprocessing techniques for data mining. It covers why preprocessing is important due to real-world data often being incomplete, noisy, and inconsistent. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data binning are also summarized.
The document provides an overview of data mining techniques and related concepts. It defines data mining and compares it to knowledge discovery in databases (KDD). It discusses the basic data mining tasks of classification, clustering, association rule mining, and summarization. It also covers related areas like databases, statistics, machine learning, and visualization techniques used in data mining. Finally, it provides an overview of common data mining techniques including decision trees, neural networks, genetic algorithms, and others.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
The document provides an overview of fundamentals of database design including definitions of key concepts like data, information, and databases. It discusses the purpose of databases and database management systems. It also covers topics like selecting a database system, database development best practices, and data entry considerations.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
This document discusses data mining and its applications. It notes that large amounts of data are being collected from various sources and stored. Data mining can help analyze this data by discovering patterns and relationships that would be difficult for humans to find manually. The document provides an overview of data mining techniques like classification and discusses software used for data mining like SAS Enterprise Miner, R, and Weka.
This document provides a summary of a course syllabus for a data warehousing and mining course. The key details include:
- The course meets on Tuesdays from 4:40-7:10pm and is taught by Professor Slobodan Vucetic.
- The objective is to discuss data management techniques like data warehouses, data marts, and online analytical processing (OLAP) for efficient data analysis.
- Topics include data warehousing, OLAP, data preprocessing, association rules, classification, clustering, and mining complex data types.
- Grading will be based on homework, quizzes, a class presentation, individual project, and a final exam.
This document discusses data science and the role of data scientists. It defines data science as using theories and principles to perform data-related tasks like collection, cleaning, integration, modeling, and visualization. It distinguishes data science from business intelligence, statistics, database management, and machine learning. Common skills for data scientists include statistics, data munging (formatting data), and visualization. Data scientists perform tasks like preparing models, running models, and communicating results.
The document discusses data preprocessing techniques for data mining. It covers why preprocessing is important due to real-world data often being incomplete, noisy, and inconsistent. The major tasks of data preprocessing are described as data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data binning are also summarized.
The document provides an overview of data mining techniques and related concepts. It defines data mining and compares it to knowledge discovery in databases (KDD). It discusses the basic data mining tasks of classification, clustering, association rule mining, and summarization. It also covers related areas like databases, statistics, machine learning, and visualization techniques used in data mining. Finally, it provides an overview of common data mining techniques including decision trees, neural networks, genetic algorithms, and others.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
The document provides an overview of fundamentals of database design including definitions of key concepts like data, information, and databases. It discusses the purpose of databases and database management systems. It also covers topics like selecting a database system, database development best practices, and data entry considerations.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
This document discusses data mining and its applications. It notes that large amounts of data are being collected from various sources and stored. Data mining can help analyze this data by discovering patterns and relationships that would be difficult for humans to find manually. The document provides an overview of data mining techniques like classification and discusses software used for data mining like SAS Enterprise Miner, R, and Weka.
This document provides a summary of a course syllabus for a data warehousing and mining course. The key details include:
- The course meets on Tuesdays from 4:40-7:10pm and is taught by Professor Slobodan Vucetic.
- The objective is to discuss data management techniques like data warehouses, data marts, and online analytical processing (OLAP) for efficient data analysis.
- Topics include data warehousing, OLAP, data preprocessing, association rules, classification, clustering, and mining complex data types.
- Grading will be based on homework, quizzes, a class presentation, individual project, and a final exam.
This document discusses data science and the role of data scientists. It defines data science as using theories and principles to perform data-related tasks like collection, cleaning, integration, modeling, and visualization. It distinguishes data science from business intelligence, statistics, database management, and machine learning. Common skills for data scientists include statistics, data munging (formatting data), and visualization. Data scientists perform tasks like preparing models, running models, and communicating results.
This document provides an introduction to data mining and data warehousing. It discusses how the volume of data being collected is growing exponentially in many fields due to advances in data collection technologies. It also describes how data mining can be used to extract useful knowledge and patterns from large datasets to help solve important problems. The document outlines some key techniques in data mining including classification, clustering, and association rule mining. It discusses how data mining draws from fields like machine learning, statistics, and databases to analyze large and complex datasets.
Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.
This document discusses data preprocessing techniques for data mining. It explains that real-world data is often dirty, containing issues like missing values, noise, and inconsistencies. Major tasks in data preprocessing include data cleaning, integration, transformation, reduction, and discretization. Data cleaning techniques are especially important and involve filling in missing values, identifying and handling outliers, resolving inconsistencies, and reducing redundancy from data integration. Other techniques discussed include binning data for smoothing noisy values and handling missing data through various imputation methods.
Data warehousing involves extracting and consolidating data from multiple sources into a centralized database to support decision-making across an organization. It provides a comprehensive, integrated, and historical view of data organized by subject area rather than transactionally. Data warehouses use dimensional modeling and star schemas to store aggregated, multidimensional data in a way that facilitates analysis. Online analytical processing (OLAP) tools enable interactive exploration and analysis of multidimensional data through techniques like drill-down, roll-up, and slicing and dicing.
The document discusses various techniques for data preprocessing including data cleaning, integration, transformation, reduction, discretization, and concept hierarchy generation. Specifically, it covers filling missing values, handling noisy data, data normalization, aggregation, attribute selection, clustering, sampling and entropy-based discretization to reduce data size while retaining important information.
The document provides an overview of data mining, including:
1) It defines data mining as the process of extracting patterns from large datasets that are valid, novel, useful and understandable.
2) It discusses some of the challenges of data mining like dealing with noise and missing data and not overfitting models.
3) It outlines several common data mining tasks like classification, clustering, association rule mining and sequential pattern mining.
The document discusses data preprocessing techniques in data mining. It covers why preprocessing is important due to real-world data often being dirty, incomplete, noisy or inconsistent. The major tasks of preprocessing are described as data cleaning, integration, transformation, reduction and discretization. Specific techniques covered include handling missing data, noisy data, data smoothing methods like binning, regression and clustering. Descriptive data analysis methods like histograms, boxplots and scatter plots are also summarized.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for data mining and analysis. It addresses issues like missing values, inconsistent data, and reducing data size. The key goals of data preprocessing are to handle data problems, integrate multiple data sources, and reduce data size while maintaining the same analytical results. Major tasks involve data cleaning, integration, transformation, and reduction.
This document defines a data warehouse as a collection of corporate information derived from operational systems and external sources to support business decisions rather than operations. It discusses the purpose of data warehousing to realize the value of data and make better decisions. Key components like staging areas, data marts, and operational data stores are described. The document also outlines evolution of data warehouse architectures and best practices.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
An efficient data preprocessing method for miningKamesh Waran
This document proposes an efficient data preprocessing method for mining customer survey data using a unified data model. Traditional preprocessing requires transforming raw data separately for each data mining algorithm, requiring significant time. The proposed method defines a standard unified data model based on survey data characteristics. Raw data is mapped to this model, reducing the number of transformations from multiple per algorithm to just one per data set. This unified approach saves substantial time in preprocessing while maintaining flexibility for different mining tools.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
Data preprocessing involves cleaning data by handling missing values, noise, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques include data cleaning, integration, transformation, reduction, discretization, and generating concept hierarchies.
Business analytics (BA) is the practice of iterative, methodical exploration of an organization's data, with an emphasis on statistical analysis. BA is used by companies committed to data-driven decision-making to gain insights that inform business decisions and can be used to automate and optimize business processes. BA techniques break down into basic business intelligence, which involves collecting and preparing data, and deeper statistical analysis. True data science involves more custom coding and open-ended questions compared to most business analysts.
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
1) Wild-type prion protein (PrP) was expressed in transgenic fruit flies to study its behavior in vivo.
2) In young flies, PrP showed properties of normal cellular PrP (PrPC), but in older flies PrP misfolded and acquired characteristics of pathogenic PrPSc, inducing neurodegeneration in fly brains.
3) The molecular chaperone Hsp70 was found to prevent the accumulation of misfolded PrP conformers and protect against PrP-dependent neurotoxicity by directly interacting with PrP, highlighting its potential therapeutic role in prion diseases.
This study examined four coding polymorphisms in the PARKIN gene (Ser167Asn, Val380Leu, Arg366Trp, and Asp394Asn) in a population from northeastern Mexico to determine their association with Parkinson's disease (PD). The study compared 117 unrelated PD patients to 122 unrelated healthy controls. Genotyping was performed using PCR-RFLP. No association was found between the Ser167Asn and Val380Leu polymorphisms and PD. For controls, these two polymorphisms showed Hardy-Weinberg disequilibrium, which could be due to a competing risk of death from a mutant gene. The Trp366 and Asn394 genotypes were not present. In conclusion, the PARK
Iranian Intercollegiate Virtual Medical Schoolshahram yazdani
The document outlines the process for establishing an Iranian Intercollegiate Virtual Medical School (IIVMS). It involves several steps:
1) Setting standards for hardware, software, content development and coding.
2) Building infrastructure like local servers, a learning management system and content development facilities.
3) Training faculty members and technical teams to develop online content and support the system.
4) Developing reusable learning objects (RLOs) and reviewing them nationally before including them in an online repository.
This document provides an introduction to data mining and data warehousing. It discusses how the volume of data being collected is growing exponentially in many fields due to advances in data collection technologies. It also describes how data mining can be used to extract useful knowledge and patterns from large datasets to help solve important problems. The document outlines some key techniques in data mining including classification, clustering, and association rule mining. It discusses how data mining draws from fields like machine learning, statistics, and databases to analyze large and complex datasets.
Data mining is the process of automatically discovering useful information from large data sets. It draws from machine learning, statistics, and database systems to analyze data and identify patterns. Common data mining tasks include classification, clustering, association rule mining, and sequential pattern mining. These tasks are used for applications like credit risk assessment, fraud detection, customer segmentation, and market basket analysis. Data mining aims to extract unknown and potentially useful patterns from large data sets.
This document discusses data preprocessing techniques for data mining. It explains that real-world data is often dirty, containing issues like missing values, noise, and inconsistencies. Major tasks in data preprocessing include data cleaning, integration, transformation, reduction, and discretization. Data cleaning techniques are especially important and involve filling in missing values, identifying and handling outliers, resolving inconsistencies, and reducing redundancy from data integration. Other techniques discussed include binning data for smoothing noisy values and handling missing data through various imputation methods.
Data warehousing involves extracting and consolidating data from multiple sources into a centralized database to support decision-making across an organization. It provides a comprehensive, integrated, and historical view of data organized by subject area rather than transactionally. Data warehouses use dimensional modeling and star schemas to store aggregated, multidimensional data in a way that facilitates analysis. Online analytical processing (OLAP) tools enable interactive exploration and analysis of multidimensional data through techniques like drill-down, roll-up, and slicing and dicing.
The document discusses various techniques for data preprocessing including data cleaning, integration, transformation, reduction, discretization, and concept hierarchy generation. Specifically, it covers filling missing values, handling noisy data, data normalization, aggregation, attribute selection, clustering, sampling and entropy-based discretization to reduce data size while retaining important information.
The document provides an overview of data mining, including:
1) It defines data mining as the process of extracting patterns from large datasets that are valid, novel, useful and understandable.
2) It discusses some of the challenges of data mining like dealing with noise and missing data and not overfitting models.
3) It outlines several common data mining tasks like classification, clustering, association rule mining and sequential pattern mining.
The document discusses data preprocessing techniques in data mining. It covers why preprocessing is important due to real-world data often being dirty, incomplete, noisy or inconsistent. The major tasks of preprocessing are described as data cleaning, integration, transformation, reduction and discretization. Specific techniques covered include handling missing data, noisy data, data smoothing methods like binning, regression and clustering. Descriptive data analysis methods like histograms, boxplots and scatter plots are also summarized.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for data mining and analysis. It addresses issues like missing values, inconsistent data, and reducing data size. The key goals of data preprocessing are to handle data problems, integrate multiple data sources, and reduce data size while maintaining the same analytical results. Major tasks involve data cleaning, integration, transformation, and reduction.
This document defines a data warehouse as a collection of corporate information derived from operational systems and external sources to support business decisions rather than operations. It discusses the purpose of data warehousing to realize the value of data and make better decisions. Key components like staging areas, data marts, and operational data stores are described. The document also outlines evolution of data warehouse architectures and best practices.
Visual data mining combines traditional data mining methods with information visualization techniques to explore large datasets. There are three levels of integration between visualization and automated mining methods - no/limited integration, loose integration where methods are applied sequentially, and full integration where methods are applied in parallel. Different visualization methods exist for univariate, bivariate and multivariate data based on the type and dimensions of the data. The document describes frameworks and algorithms for visual data mining, including developing new algorithms interactively through a visual interface. It also summarizes a document on using data mining and visualization techniques for selective visualization of large spatial datasets.
An efficient data preprocessing method for miningKamesh Waran
This document proposes an efficient data preprocessing method for mining customer survey data using a unified data model. Traditional preprocessing requires transforming raw data separately for each data mining algorithm, requiring significant time. The proposed method defines a standard unified data model based on survey data characteristics. Raw data is mapped to this model, reducing the number of transformations from multiple per algorithm to just one per data set. This unified approach saves substantial time in preprocessing while maintaining flexibility for different mining tools.
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Introduction to Datamining Concept and TechniquesSơn Còm Nhom
This document provides an introduction to data mining techniques. It discusses data mining concepts like data preprocessing, analysis, and visualization. For data preprocessing, it describes techniques like similarity measures, down sampling, and dimension reduction. For data analysis, it explains clustering, classification, and regression methods. Specifically, it gives examples of k-means clustering and support vector machine classification. The goal of data mining is to retrieve hidden knowledge and rules from data.
Data preprocessing involves cleaning data by handling missing values, noise, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques include data cleaning, integration, transformation, reduction, discretization, and generating concept hierarchies.
Business analytics (BA) is the practice of iterative, methodical exploration of an organization's data, with an emphasis on statistical analysis. BA is used by companies committed to data-driven decision-making to gain insights that inform business decisions and can be used to automate and optimize business processes. BA techniques break down into basic business intelligence, which involves collecting and preparing data, and deeper statistical analysis. True data science involves more custom coding and open-ended questions compared to most business analysts.
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
1) Wild-type prion protein (PrP) was expressed in transgenic fruit flies to study its behavior in vivo.
2) In young flies, PrP showed properties of normal cellular PrP (PrPC), but in older flies PrP misfolded and acquired characteristics of pathogenic PrPSc, inducing neurodegeneration in fly brains.
3) The molecular chaperone Hsp70 was found to prevent the accumulation of misfolded PrP conformers and protect against PrP-dependent neurotoxicity by directly interacting with PrP, highlighting its potential therapeutic role in prion diseases.
This study examined four coding polymorphisms in the PARKIN gene (Ser167Asn, Val380Leu, Arg366Trp, and Asp394Asn) in a population from northeastern Mexico to determine their association with Parkinson's disease (PD). The study compared 117 unrelated PD patients to 122 unrelated healthy controls. Genotyping was performed using PCR-RFLP. No association was found between the Ser167Asn and Val380Leu polymorphisms and PD. For controls, these two polymorphisms showed Hardy-Weinberg disequilibrium, which could be due to a competing risk of death from a mutant gene. The Trp366 and Asn394 genotypes were not present. In conclusion, the PARK
Iranian Intercollegiate Virtual Medical Schoolshahram yazdani
The document outlines the process for establishing an Iranian Intercollegiate Virtual Medical School (IIVMS). It involves several steps:
1) Setting standards for hardware, software, content development and coding.
2) Building infrastructure like local servers, a learning management system and content development facilities.
3) Training faculty members and technical teams to develop online content and support the system.
4) Developing reusable learning objects (RLOs) and reviewing them nationally before including them in an online repository.
This document provides the program details for the 36th Annual Meeting of the Texas Genetics Society held from April 2-4, 2009 in Austin, Texas. The meeting featured invited speakers, contributed oral presentations organized into sessions, poster presentations, and social events. Presentation topics included genetics of mouse mammary tumor virus, microRNA regulation in lung cancer, myostatin expression in livestock, structural variation in the equine genome, prion protein conversion in animal models, and mapping genes related to disease resistance in rats and ticks. The program provides an opportunity for genetics researchers and students in Texas to present their work and discuss current topics in genetics.
Ch.1/L1 - Italy: the Birthplace of the Renaissancecalebgunnels
This document provides an overview of a lesson on the Renaissance in Italy. It begins by outlining the objectives of explaining the conditions that gave rise to the Renaissance, identifying prized values and ideas, and describing influential artistic and literary works. It then discusses factors like the influence of Italian city-states, merchants like the Medici family, and interest in classical works. Key Renaissance figures in art, like Michelangelo, Raphael, and Leonardo da Vinci, are noted for their realistic paintings and sculptures. Influential writers including Petrarch, Boccaccio, and Machiavelli are also summarized. The document concludes by stating Renaissance ideas began spreading from Italy into other parts of Europe in the late
The document discusses recursion and provides examples of recursive algorithms like factorial, Fibonacci series, and Towers of Hanoi. It explains recursion using these examples and discusses the disadvantages of recursion. It also covers divide and conquer algorithms like quicksort and binary search. Finally, it discusses backtracking and provides the example of the eight queens problem to illustrate recursive backtracking.
Our comprehensive list of SMTP relay server plans showcased hereunder benefits our clients a lot by helping them send large volumes of bulk emails (around thousands to millions of emails per month), at relatively lower prices and create better brand value.
This document provides a lesson template for separating mixtures of immiscible liquids. It includes details like the class, topic, duration, learning objectives, and procedures. The key points are:
- The lesson is on separating mixtures of two immiscible liquids like oil and water using a separating funnel.
- Students will learn that immiscible liquids like oil settle on top while heavier ones settle below due to differences in density.
- The procedure involves using a separating funnel to separate oil and water mixtures, with the oil separating above and water below due to their relative densities.
There are many internal and external struggles that people need to overcome in order to become better runners. This presentation has a few pointers that can help you stay on track throughout your journey to becoming a better runner.
Twitter allows users to promote tweets to targeted audiences through its promoted tweets feature. Promoted tweets can be delivered to both followers and non-followers based on demographics for a cost per engagement. Companies use promoted tweets along with hashtags and analytics to help grow their follower base and track social media performance. While any business can use promoted tweets, setting a daily budget and bidding on a cost per engagement between $0.20 to $5.00 per engagement is important.
This document discusses data preprocessing techniques for data mining. It covers why preprocessing is important for obtaining quality mining results from quality data. The major tasks of data preprocessing are described, including data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data integration are also outlined. The goals of data reduction strategies like dimensionality and numerosity reduction are explained.
This document discusses data preprocessing techniques. It defines data preprocessing as transforming raw data into an understandable format. Major tasks in data preprocessing are described as data cleaning, integration, transformation, and reduction. Data cleaning involves handling missing data, noisy data, and inconsistencies. Data integration combines data from multiple sources. Data transformation techniques include smoothing, aggregation, generalization, and normalization. The goal of data reduction is to reduce the volume of data while maintaining analytical results.
This document discusses data preprocessing techniques for data mining. It covers why preprocessing is important by explaining how real-world data can be dirty, incomplete, noisy, or inconsistent. It then describes common preprocessing tasks like data cleaning, integration, transformation, reduction, and discretization. Specific techniques are explained, such as handling missing data, outliers, normalization, and binning. The goal of preprocessing is to prepare raw data into a format suitable for data mining algorithms to produce accurate and useful results.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
Machine learning topics
machine learning algorithm into three main parts.
A Decision Process: In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labeled or unlabeled, your algorithm will produce an estimate about a pattern in the data.
An Error Function: An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.
A Model Optimization Process: If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating weights autonomously until a threshold of accuracy has been met.
This document provides an overview of key aspects of data preparation and processing for data mining. It discusses the importance of domain expertise in understanding data. The goals of data preparation are identified as cleaning missing, noisy, and inconsistent data; integrating data from multiple sources; transforming data into appropriate formats; and reducing data through feature selection, sampling, and discretization. Common techniques for each step are outlined at a high level, such as binning, clustering, and regression for handling noisy data. The document emphasizes that data preparation is crucial and can require 70-80% of the effort for effective real-world data mining.
The document discusses data preprocessing techniques for data mining. It covers why preprocessing is important to ensure quality data and mining results. The major tasks covered are data cleaning, integration, transformation, reduction, and discretization. Data cleaning involves techniques for handling missing data, noisy data, and inconsistencies. Data integration combines multiple data sources. Data transformation includes normalization, aggregation, and feature construction. Data reduction strategies aim to reduce data size for mining while maintaining analytical quality and include cube aggregation, dimensionality reduction, and numerosity reduction.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling and analysis. The document discusses several key aspects of data preprocessing including:
- Why data preprocessing is important to improve data quality and ensure accurate analysis results.
- Common data issues like missing values, noise, inconsistencies that require cleaning. Techniques for cleaning include filling in missing data, identifying and handling outliers, and resolving inconsistencies.
- Methods for reducing data like binning, regression, clustering, sampling to obtain a smaller yet representative version of the data.
- The major tasks in preprocessing like data cleaning, integration, transformation, reduction and discretization which are aimed at handling real-world data issues.
This document discusses data preprocessing techniques for data mining. It covers why preprocessing is important for obtaining quality data and mining results. The major tasks covered include data cleaning, integration, transformation, reduction, and discretization. Data cleaning techniques discussed include handling missing data, noisy data, and inconsistent data through methods like filling in values, smoothing, and resolving inconsistencies. Descriptive data analysis is also covered through statistical measures of central tendency, dispersion, and visualizations.
This document provides an overview of data preprocessing techniques. It discusses why preprocessing is important, including that real-world data is often dirty, incomplete, noisy, and inconsistent. The major tasks of preprocessing are described as data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and reducing redundancy are also summarized.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, reduce data size for analysis, and prepare data for mining algorithms through techniques like discretization and concept hierarchy generation.
This document discusses data preprocessing for machine learning. It covers the importance of data preprocessing to clean and prepare raw data before building machine learning models. Specifically, it discusses tasks like data cleaning to handle missing values, noisy data and outliers. It also covers data integration, reduction and transformation techniques such as normalization, discretization and concept hierarchy generation. The goal of these techniques is to improve data quality and make it suitable for machine learning algorithms.
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
The document provides an overview of data mining techniques and processes. It discusses data mining as the process of extracting knowledge from large amounts of data. It describes common data mining tasks like classification, regression, clustering, and association rule learning. It also outlines popular data mining processes like CRISP-DM and SEMMA that involve steps of business understanding, data preparation, modeling, evaluation and deployment. Decision trees are presented as a popular classification technique that uses a tree structure to split data into nodes and leaves to classify examples.
The document discusses data mining and data preprocessing. It notes that vast amounts of data are collected daily and analyzing this data is important. One emerging data repository is the data warehouse, which contains multiple heterogeneous data sources. Data mining transforms large collections of data into useful knowledge. Effective data preprocessing is important for improving data quality and mining accuracy. Techniques like data cleaning, integration, reduction, and transformation are used to handle issues like missing values, noise, inconsistencies and improve the overall quality of the data.
The document discusses data preprocessing concepts from Chapter 3 of the book "Data Mining: Concepts and Techniques". It covers topics like data quality, major tasks in preprocessing including data cleaning, integration and reduction. Data cleaning involves handling incomplete, noisy and inconsistent data using techniques such as imputation of missing values, smoothing of noisy data, and resolving inconsistencies. Data integration combines data from multiple sources which requires tasks like schema integration and entity identification. Data reduction techniques include dimensionality reduction and data compression.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data by normalization, aggregation, and reduction. The document discusses techniques for data cleaning like binning and clustering to handle noisy data. It also covers data integration, transformation through normalization, and reduction using histograms, clustering, and sampling. Discretization and concept hierarchies are introduced as techniques to reduce continuous attributes for data analysis.
Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
Data mining and data warehouse lab manual updatedYugal Kumar
This document describes experiments conducted for a Data Mining and Data Warehousing Lab course. Experiment 1 involves studying data pre-processing steps using a dataset. Experiment 2 involves implementing a decision tree classification algorithm in Java. Experiment 3 uses the WEKA tool to implement the ID3 decision tree algorithm on a bank dataset, generating and visualizing the decision tree model. The experiments aim to help students understand key concepts in data mining such as pre-processing, classification algorithms, and using tools like WEKA.
1) The document discusses data mining, which is defined as extracting information from large datasets. It can be used for applications like market analysis, fraud detection, and customer retention.
2) It explains the basics of data mining including the KDD (Knowledge Discovery in Databases) process and various data mining tasks and techniques.
3) The KDD process is described as the organized procedure for discovering useful patterns from large, complex datasets through steps like data cleaning, integration, selection, transformation, mining, evaluation and presentation.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
2. Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
September 6, 2014 Data Mining: Concepts and Techniques 2
4. Why Data Preprocessing?
Data in the real world is dirty (huge size, multiple heterogeneous sources)
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., occupation=“”
noisy: containing errors or outliers
e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or
names
e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
September 6, 2014 Data Mining: Concepts and Techniques 4
5. Why Is Data Dirty?
Incomplete data comes from
n/a data value when collected
different consideration between the time when the data was
collected and when it is analyzed.
human/hardware/software problems
Noisy data comes from the process of data
collection
entry
transmission
Inconsistent data comes from
Different data sources
Functional dependency violation
September 6, 2014 Data Mining: Concepts and Techniques 5
6. Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect
or even misleading statistics.
Data warehouse needs consistent integration of quality
data
Data Processing helps improve efficiency and ease of
mining process
Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse. —
Bill Inmon
September 6, 2014 Data Mining: Concepts and Techniques 6
7. Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Believability
Timeliness
Value added
Interpretability
Accessibility
Broad categories:
intrinsic, contextual, representational, and accessibility.
http://www.dataquality.com/998mathieu.htm
September 6, 2014 Data Mining: Concepts and Techniques 7
8. Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially
for numerical data
September 6, 2014 Data Mining: Concepts and Techniques 8
9. Forms of data preprocessing
September 6, 2014 Data Mining: Concepts and Techniques 9
10. Detecting data anomalies, rectifying them early
and reducing the data to be analysed can lead to
huge payoffs for decision making.
Improves efficiency and ease of the mining
process
September 6, 2014 Data Mining: Concepts and Techniques 10
11. Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
September 6, 2014 Data Mining: Concepts and Techniques 11
12. Data Cleaning
Importance
“Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
September 6, 2014 Data Mining: Concepts and Techniques 12
13. Missing Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
September 6, 2014 Data Mining: Concepts and Techniques 13
14. How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
September 6, 2014 Data Mining: Concepts and Techniques 14
15. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
September 6, 2014 Data Mining: Concepts and Techniques 15
16. How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
Regression
smooth by fitting the data into regression functions
September 6, 2014 Data Mining: Concepts and Techniques 16
17. Simple Discretization Methods: Binning
Equal-width (distance) partitioning:
Divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
September 6, 2014 Data Mining: Concepts and Techniques 17
18. Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,
34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
September 6, 2014 Data Mining: Concepts and Techniques 18
20. Regression
x
y
y = x + 1
X1
Y1
Y1’
September 6, 2014 Data Mining: Concepts and Techniques 20
21. Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
September 6, 2014 Data Mining: Concepts and Techniques 21
22. Data Integration
Data integration:
combines data from multiple sources into a coherent
store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id º B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different
scales, e.g., metric vs. British units
September 6, 2014 Data Mining: Concepts and Techniques 22
23. Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple
databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
September 6, 2014 Data Mining: Concepts and Techniques 23
24. Data Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Low level raw data -> higher level concepts eg. Age -> youth, middle_age, senior
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
September 6, 2014 Data Mining: Concepts and Techniques 24
25. Data Transformation: Normalization
min-max normalization (v - value of attribute A)
v' v min ( _ - _ ) + _
A new max new min new min
A A A
= -
max -
min
A A
z-score normalisation (zero-mean normalisation) use when min and max
values unknown, outliers dominate min max normalisation
v v mean
A
'= -
stand _
dev
normalisation by decimal scaling
A
'= Where j is the smallest integer such that Max(| v ' |)<1
j
v v
10
Exercise: normalize {0, 6, 8, 14} and {6, 6, 8, 8}
September 6, 2014 Data Mining: Concepts and Techniques 25
26. Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
September 6, 2014 Data Mining: Concepts and Techniques 26
27. Data Reduction Strategies
A data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on
the complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
Data reduction strategies
Data cube aggregation (aggregation operations to construct data cube)
Dimensionality reduction— remove unimportant attributes, encoding mechanism to
reduce data size
Data Compression
Numerosity reduction—fit data into models (parametric: store model parameters.
Non-parametric: clustering, sampling, histogram)
Discretization and concept hierarchy generation – raw data values replaced by
ranges or higher conceptual levels
Computational time on data reduction not to outweigh time saved in mining reduced data
set
September 6, 2014 Data Mining: Concepts and Techniques 27
28. Data Cube Aggregation
The lowest level of a data cube (called the base cuboid)
the aggregated data for an individual entity of interest
e.g., sales, customer in a phone calling data warehouse.
Multiple levels of aggregation in data cubes (cuboids, lattice)
Higher levels of abstraction further reduce the size of data to deal
with
Reference appropriate levels
Use the smallest representation which is enough to
solve the task
Queries regarding aggregated information should be
answered using data cube, when possible
September 6, 2014 Data Mining: Concepts and Techniques 28
29. Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
reduce # of attributes in the discovered patterns makes it
easier to understand the pattern
How to find ‘good’ subset of original attributes?
Heuristic methods (due to exponential # of choices, n attr -> 2n subsets):
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
September 6, 2014 Data Mining: Concepts and Techniques 29
30. Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
Y
N
internal non-leaf node - test on that
attribute
Branch – test outcome
Leaf node - decision
A1? A6?
At each node choose best
attribute to partition data into
individual classes (from
given data). Attributes not in
tree are irrelevant
Y N Y
N
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
September 6, 2014 Data Mining: Concepts and Techniques 30
31. Data Compression (compressed representation of
original data)
String compression
There are extensive theories and well-tuned algorithms
Typically lossless (original can be reconstructed without loss of
data)
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement (approximate reconstruction of original)
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Time sequence
Typically short and vary slowly with time
September 6, 2014 Data Mining: Concepts and Techniques 32
32. Data Compression
Original Data Compressed
Data
lossless
Original Data
Approximated
lossy
September 6, 2014 Data Mining: Concepts and Techniques 33
33. Wavelet Transformation
Haar2 Daubechie4
Discrete wavelet transform (DWT): linear signal
processing, multiresolutional analysis
Compressed approximation: store only a small fraction of
the strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better
lossy compression, localized in space
Method:
Length, L, must be an integer power of 2 (padding with 0s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
September 6, 2014 Data Mining: Concepts and Techniques 34
34. Principal Component Analysis
Given N data vectors from k-dimensions, find c <= k
orthogonal vectors that can be best used to represent
data
The original data set is reduced to one consisting of N
data vectors on c principal components (reduced
dimensions)
Each data vector is a linear combination of the c principal
component vectors
Works for numeric data only
Used when the number of dimensions is large
September 6, 2014 Data Mining: Concepts and Techniques 35
35. Numerosity Reduction
Can we reduce data volume by choosing alternative smaller forms of
data representation?
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
Log-linear models: obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
September 6, 2014 Data Mining: Concepts and Techniques 36
36. Regression and Log-Linear Models
Linear regression: Data are modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression: allows a response variable Y to be
modeled as a linear function of multidimensional feature
vector
Log-linear model: approximates discrete
multidimensional probability distributions
September 6, 2014 Data Mining: Concepts and Techniques 37
37. Regress Analysis and Log-Linear
Models
Linear regression: Y = a + b X
Two parameters , a and b specify the line and are to
be estimated by using the data at hand.
using the least squares criterion to the known values of
Y1, Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the
above.
Log-linear models:
The multi-way table of joint probabilities is
approximated by a product of lower-order tables.
Probability: p(a, b, c, d) = aab baccad dbcd
38. Histograms (buckets :- attribute-value/frequency)
A popular data reduction
technique
Divide data into buckets
and store average (sum)
for each bucket
Can be constructed
optimally in one
dimension using dynamic
programming
Related to quantization
problems.
40
35
30
25
20
15
10
5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Eg: List of prices of items sold 1,1,5,5,5,5,5,8,8,10,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30
September 6, 2014 Data Mining: Concepts and Techniques 39
39. Clustering
Partition data set into clusters, and one can store
cluster representation only
Can be very effective if data is clustered but not if data
is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
There are many choices of clustering definitions and
clustering algorithms
September 6, 2014 Data Mining: Concepts and Techniques 40
40. Sampling (a large data set represented by much smaller random sample)
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
Stratified sampling:
Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
September 6, 2014 Data Mining: Concepts and Techniques 41
41. Sampling
SRSWOR
(simple random
sample without
replacement)
SRSWR
Raw Data
September 6, 2014 Data Mining: Concepts and Techniques 42
42. Sampling
Raw Data Cluster/Stratified Sample
September 6, 2014 Data Mining: Concepts and Techniques 43
43. Hierarchical Reduction
Use multi-resolution structure with different degrees of
reduction
Hierarchical clustering is often performed but tends to
define partitions of data sets rather than “clusters”
Parametric methods are usually not amenable to
hierarchical representation
Hierarchical aggregation
An index tree hierarchically divides a data set into
partitions by value range of some attributes
Each partition can be considered as a bucket
Thus an index tree with aggregates stored at each
node is a hierarchical histogram
September 6, 2014 Data Mining: Concepts and Techniques 44
44. Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
September 6, 2014 Data Mining: Concepts and Techniques 45
45. Discretization (divide range of attributes into intervals)
Three types of attributes:
Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into
intervals
Some classification algorithms only accept
categorical attributes.
Reduce data size by discretization
Prepare for further analysis
September 6, 2014 Data Mining: Concepts and Techniques 46
46. Discretization and Concept hierachy
Discretization
reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values
Concept hierarchies
reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young, middle-aged,
or senior)
September 6, 2014 Data Mining: Concepts and Techniques 47
47. Discretization and Concept Hierarchy
Generation for Numeric Data
Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Segmentation by natural partitioning
September 6, 2014 Data Mining: Concepts and Techniques 48
48. Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
( , ) S S S Ent S
E S T
| |
| |
= 1 + ( )
S
Ent
| 2
|
| S
|
( )
1
2
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent(S) - E(T,S) >d
Experiments show that it may reduce data size and
improve classification accuracy
September 6, 2014 Data Mining: Concepts and Techniques 49
49. Segmentation by Natural Partitioning
A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the
most significant digit, partition the range into 3 equi-width
intervals
If it covers 2, 4, or 8 distinct values at the most
significant digit, partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
September 6, 2014 Data Mining: Concepts and Techniques 50
51. Concept Hierarchy Generation for
Categorical Data
Discrete data. Categorical attributes have finite (but possibly large) number of distinct vales. No
ordering among values, eg. geographic location, job category, item type
Specification of a partial ordering of attributes explicitly at the
schema level by users or experts
street<city<state<country
Specification of a portion of a hierarchy by explicit data
grouping
{Urbana, Champaign, Chicago}<Illinois
Specification of a set of attributes.
System automatically generates partial ordering by
analysis of the number of distinct values (a higher levell attribute has
less distinct values than lower level attribute)
E.g., street < city <state < country
Specification of only a partial set of attributes
E.g., only street < city, not others
September 6, 2014 Data Mining: Concepts and Techniques 52
52. Automatic Concept Hierarchy
Generation
Some concept hierarchies can be automatically
generated based on the analysis of the number of
distinct values per attribute in the given data set
The attribute with the most distinct values is placed
at the lowest level of the hierarchy
Note: Exception—weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
65 distinct
values
3567 distinct values
674,339 distinct values
September 6, 2014 Data Mining: Concepts and Techniques 53
53. Chapter 3: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
September 6, 2014 Data Mining: Concepts and Techniques 54
54. Summary
Data preparation is a big issue for both warehousing
and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an active
area of research
September 6, 2014 Data Mining: Concepts and Techniques 55
55. References
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin
of the Technical Committee on Data Engineering. Vol.23, No.4
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.
Communications of ACM, 42:73-78, 1999.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the
Technical Committee on Data Engineering, 20(4), December 1997.
A. Maydanchik, Challenges of Efficient Data Cleansing (DM Review - Data Quality
resource portal)
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
D. Quass. A Framework for research in Data Cleaning. (Draft 1999)
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning
and Transformation, VLDB’2001.
T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations.
Communications of ACM, 39:86-95, 1996.
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE
Trans. Knowledge and Data Engineering, 7:623-640, 1995.
http://www.cs.ucla.edu/classes/spring01/cs240b/notes/data-integration1.pdf
September 6, 2014 Data Mining: Concepts and Techniques 56