The document discusses data mining functionalities including descriptive and predictive tasks. Descriptive tasks characterize data properties, while predictive tasks perform induction to make predictions on data. Specifically, it describes concept/class description which involves characterizing and discriminating classes/concepts by summarizing target classes, comparing them to contrasting classes, and presenting outputs in forms like charts and data cubes.
This document discusses data mining and related topics. It begins by defining data mining as the process of discovering patterns in large datasets using methods from machine learning, statistics, and database systems. The document then discusses data warehouses, how they work, and their role in data mining. It describes different data mining functionalities and tasks such as classification, prediction, and clustering. The document outlines some common data mining applications and issues related to methodology, performance, and diverse data types. Finally, it discusses some social implications of data mining involving privacy, profiling, and unauthorized use of data.
This document provides an introduction to data mining and data warehousing. It defines data mining as the process of extracting knowledge from large amounts of data. The evolution of database technologies is discussed, from early file processing systems to today's data warehousing and data mining capabilities. Key aspects of data mining systems and processes are described, including the typical architecture of a data mining system with components like data sources, data mining engines, and knowledge bases. The document also discusses the types of data that data mining can be applied to, such as relational databases and data warehouses.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
Data mining involves classification, cluster analysis, outlier mining, and evolution analysis. Classification models data to distinguish classes using techniques like decision trees or neural networks. Cluster analysis groups similar objects without labels, while outlier mining finds irregular objects. Evolution analysis models changes over time. Data mining performance considers algorithm efficiency, scalability, and handling diverse and complex data types from multiple sources.
Data Mining: Data mining classification and analysisDatamining Tools
Data mining involves classification, cluster analysis, outlier mining, and evolution analysis. Classification models data to distinguish classes using techniques like decision trees or neural networks. Cluster analysis groups similar objects without labels, while outlier mining finds irregular objects. Evolution analysis models changes over time. Data mining performance depends on algorithm efficiency and scalability for large datasets across diverse database types.
This document provides an overview of key concepts in data mining including data preprocessing, data warehouses, frequent patterns, association rule mining, classification, clustering, outlier analysis and more. It discusses different types of databases that can be mined such as relational, transactional, temporal and spatial databases. The document also covers data characterization, discrimination, interestingness measures and different types of data mining systems.
Data mining involves discovering hidden patterns in data, while data warehousing involves integrating data from multiple sources and storing it in a centralized location to support analysis. Some key differences are:
- Data mining uses techniques like classification, clustering, and association to discover insights from data, while data warehousing focuses on data integration and OLAP tools.
- Data mining looks for unknown relationships and makes predictions, while data warehousing provides a way to extract and analyze historical data.
- Data warehousing involves extracting, cleaning, and transforming data during an ETL process before loading it into a separate database optimized for analysis. Data mining builds on the outputs of data warehousing.
This document discusses data mining and related topics. It begins by defining data mining as the process of discovering patterns in large datasets using methods from machine learning, statistics, and database systems. The document then discusses data warehouses, how they work, and their role in data mining. It describes different data mining functionalities and tasks such as classification, prediction, and clustering. The document outlines some common data mining applications and issues related to methodology, performance, and diverse data types. Finally, it discusses some social implications of data mining involving privacy, profiling, and unauthorized use of data.
This document provides an introduction to data mining and data warehousing. It defines data mining as the process of extracting knowledge from large amounts of data. The evolution of database technologies is discussed, from early file processing systems to today's data warehousing and data mining capabilities. Key aspects of data mining systems and processes are described, including the typical architecture of a data mining system with components like data sources, data mining engines, and knowledge bases. The document also discusses the types of data that data mining can be applied to, such as relational databases and data warehouses.
The document provides an overview of data mining and data warehousing concepts through a series of lectures. It discusses the evolution of database technology and data analysis, defines data mining and knowledge discovery, describes data mining functionalities like classification and clustering, and covers data warehouse concepts like dimensional modeling and OLAP operations. It also presents sample queries in a proposed data mining query language.
Data mining involves classification, cluster analysis, outlier mining, and evolution analysis. Classification models data to distinguish classes using techniques like decision trees or neural networks. Cluster analysis groups similar objects without labels, while outlier mining finds irregular objects. Evolution analysis models changes over time. Data mining performance considers algorithm efficiency, scalability, and handling diverse and complex data types from multiple sources.
Data Mining: Data mining classification and analysisDatamining Tools
Data mining involves classification, cluster analysis, outlier mining, and evolution analysis. Classification models data to distinguish classes using techniques like decision trees or neural networks. Cluster analysis groups similar objects without labels, while outlier mining finds irregular objects. Evolution analysis models changes over time. Data mining performance depends on algorithm efficiency and scalability for large datasets across diverse database types.
This document provides an overview of key concepts in data mining including data preprocessing, data warehouses, frequent patterns, association rule mining, classification, clustering, outlier analysis and more. It discusses different types of databases that can be mined such as relational, transactional, temporal and spatial databases. The document also covers data characterization, discrimination, interestingness measures and different types of data mining systems.
Data mining involves discovering hidden patterns in data, while data warehousing involves integrating data from multiple sources and storing it in a centralized location to support analysis. Some key differences are:
- Data mining uses techniques like classification, clustering, and association to discover insights from data, while data warehousing focuses on data integration and OLAP tools.
- Data mining looks for unknown relationships and makes predictions, while data warehousing provides a way to extract and analyze historical data.
- Data warehousing involves extracting, cleaning, and transforming data during an ETL process before loading it into a separate database optimized for analysis. Data mining builds on the outputs of data warehousing.
This document provides an introduction to data mining. It defines data mining as the process of extracting knowledge from large amounts of data. The document outlines the typical steps in the knowledge discovery process including data cleaning, transformation, mining, and evaluation. It also describes some common challenges in data mining like dealing with large, high-dimensional, heterogeneous and distributed data. Finally, it summarizes several common data mining tasks like classification, association analysis, clustering, and anomaly detection.
This document provides an overview of data warehousing and data mining. It begins by defining a data warehouse as a system that contains historical and cumulative data from single or multiple sources for simplifying reporting, analysis, and decision making. It describes three common data warehouse architectures and the key components of a data warehouse, including the database, ETL tools, metadata, query tools, and data marts. The document then defines data mining as extracting usable data from raw data using software to analyze patterns. It outlines descriptive and predictive data mining tasks and techniques like clustering, associations, summarization, prediction, and classification. Finally, it provides examples of data mining applications and discusses how AWS services like Amazon Redshift can provide scalable data warehousing
1. The document discusses database management systems and provides an overview of basic database concepts. It defines what a database is and explains that a database stores related data that can be accessed, updated, and retrieved as needed.
2. Key concepts covered include data types, records, fields, and file structure. Characteristics of databases like data independence and abstraction, self-describing nature, and data sharing are explained.
3. Database types and functions of database management systems are summarized, including organizing, integrating, and retrieving data as well as providing security, backup/recovery, and data access languages. Common data structures like stacks, queues, and linked lists are also mentioned.
This document provides an introduction to databases and data mining. It defines what a database is and describes different types of databases, including centralized, distributed, personal, end user, commercial, NoSQL, operational, relational, cloud, and object-oriented databases. It also discusses database management systems and their role in maintaining database security, integrity, and accessibility. The document then introduces concepts related to data warehousing and data mining, including definitions and common uses.
The document discusses different levels of coupling between data mining (DM) systems and database/data warehouse (DB/DW) systems. It defines:
1) No coupling as DM systems operating independently without utilizing any DB/DW functions.
2) Loose coupling as DM systems fetching data from and storing results in DB/DW systems.
3) Semi-tight coupling as DM systems linking to and using efficient implementations of some DM functions within DB/DW systems.
4) Tight coupling as DM systems being fully integrated with and optimized based on the query processing and data structures of DB/DW systems.
This document outlines the learning objectives and resources for a course on data mining and analytics. The course aims to:
1) Familiarize students with key concepts in data mining like association rule mining and classification algorithms.
2) Teach students to apply techniques like association rule mining, classification, cluster analysis, and outlier analysis.
3) Help students understand the importance of applying data mining concepts across different domains.
The primary textbook listed is "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber. Topics that will be covered include introduction to data mining, preprocessing, association rules, classification algorithms, cluster analysis, and applications.
The document discusses data warehousing concepts and technologies. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data used for decision making. Key aspects covered include multidimensional data modeling using facts, dimensions, and cubes; data warehouse architectures; and efficient cube computation methods such as ROLAP-based algorithms.
1) Data mining involves extracting hidden patterns from large datasets to discover useful information. It is an interdisciplinary field drawing from statistics, machine learning, database technology and more.
2) The overall goal is to extract information and transform it into an understandable structure. This includes data cleaning, integration, selection, transformation, mining patterns, and evaluating/presenting the results.
3) Data mining is used for applications like market analysis, risk analysis, fraud detection and more, across domains like business, science, health, and society. It has the potential to provide insights from vast amounts of accumulated data.
This document provides an overview of data warehousing. It defines data warehousing as collecting data from multiple sources into a central repository for analysis and decision making. The document outlines the history of data warehousing and describes its key characteristics like being subject-oriented, integrated, and time-variant. It also discusses the architecture of a data warehouse including sources, transformation, storage, and reporting layers. The document compares data warehousing to traditional DBMS and explains how data warehouses are better suited for analysis versus transaction processing.
The document discusses key concepts in data warehousing including:
1) The distinction between data and information, with data becoming valuable when organized and presented as information for decision making.
2) Characteristics of a data warehouse including being subject-oriented, integrated, non-volatile, time-variant, and accessible to end-users.
3) Differences between operational data and data warehouse data including the data warehouse being subject-oriented, summarized over time, and serving managerial communities rather than transactional needs.
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
DBMS Schemas for Decision Support , Star Schema, Snowflake Schema, Fact Constellation Schema, Schema Definition, Data extraction, clean up and transformation tools.
Unit-IV-Introduction to Data Warehousing .pptxHarsha Patel
Data warehousing combines data from multiple sources to ensure data quality and accuracy. It separates analytics processing from transactional databases. A data warehouse stores historical data and allows fast querying of all data, using OLAP, while a database stores current transactions for online processing using OLTP. A multidimensional data model organizes data into cubes with dimensions and facts to allow analyzing data from different perspectives. Key components of a data warehouse architecture include external data sources, a staging area using ETL, the data warehouse, and data marts containing subsets of warehouse data.
Detailed slides of data resource management. The relationships among the many individual data elements stored in databases are based on one of several logical data structures, or models.
A data warehouse consists of several key components:
- Current detail data from operational systems of record which is stored for analysis.
- Integration and transformation programs that convert operational data into a common format for the data warehouse.
- Summarized and archived data used for reporting and analysis over time.
- Metadata that describes the structure and meaning of the data.
Data warehouses are used for standard reporting, queries on summarized data, and data mining of patterns in large datasets to gain business insights.
The document provides an introduction to data mining and knowledge discovery. It discusses how large amounts of data are extracted and transformed into useful information for applications like market analysis and fraud detection. The key steps in the knowledge discovery process are described as data cleaning, integration, selection, transformation, mining, pattern evaluation, and knowledge presentation. Common data sources, database architectures, and types of coupling between data mining systems and databases are also outlined.
Data science involves analyzing structured, semi-structured, and unstructured data to extract knowledge and insights. It employs techniques from fields like statistics, computer science, and information science. Data scientists possess strong skills in programming, statistics, data modeling, and machine learning. The data processing lifecycle involves data acquisition, analysis, curation, storage, and exploration. Big data is characterized by its volume, velocity, and variety. Technologies like Hadoop use clustered computing and distributed storage like HDFS to efficiently process and store large amounts of structured and unstructured data.
The document defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. A data warehouse is maintained separately from operational databases and provides a platform for consolidated historical data analysis. Key features of a data warehouse include dimensional modeling using facts, dimensions, and star or snowflake schemas.
This document provides an overview of data science and key concepts related to emerging technologies. It describes what data science is and its role, differentiates between data and information, describes the data processing life cycle and common data types. It also discusses the basics of big data, including characteristics like volume, velocity and variety. Finally, it introduces clustered computing and components of the Hadoop ecosystem.
This document discusses various data mining techniques, including artificial neural networks. It provides an overview of the knowledge discovery in databases process and the cross-industry standard process for data mining. It also describes techniques such as classification, clustering, regression, association rules, and neural networks. Specifically, it discusses how neural networks are inspired by biological neural networks and can be used to model complex relationships in data.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
This document provides an introduction to data mining. It defines data mining as the process of extracting knowledge from large amounts of data. The document outlines the typical steps in the knowledge discovery process including data cleaning, transformation, mining, and evaluation. It also describes some common challenges in data mining like dealing with large, high-dimensional, heterogeneous and distributed data. Finally, it summarizes several common data mining tasks like classification, association analysis, clustering, and anomaly detection.
This document provides an overview of data warehousing and data mining. It begins by defining a data warehouse as a system that contains historical and cumulative data from single or multiple sources for simplifying reporting, analysis, and decision making. It describes three common data warehouse architectures and the key components of a data warehouse, including the database, ETL tools, metadata, query tools, and data marts. The document then defines data mining as extracting usable data from raw data using software to analyze patterns. It outlines descriptive and predictive data mining tasks and techniques like clustering, associations, summarization, prediction, and classification. Finally, it provides examples of data mining applications and discusses how AWS services like Amazon Redshift can provide scalable data warehousing
1. The document discusses database management systems and provides an overview of basic database concepts. It defines what a database is and explains that a database stores related data that can be accessed, updated, and retrieved as needed.
2. Key concepts covered include data types, records, fields, and file structure. Characteristics of databases like data independence and abstraction, self-describing nature, and data sharing are explained.
3. Database types and functions of database management systems are summarized, including organizing, integrating, and retrieving data as well as providing security, backup/recovery, and data access languages. Common data structures like stacks, queues, and linked lists are also mentioned.
This document provides an introduction to databases and data mining. It defines what a database is and describes different types of databases, including centralized, distributed, personal, end user, commercial, NoSQL, operational, relational, cloud, and object-oriented databases. It also discusses database management systems and their role in maintaining database security, integrity, and accessibility. The document then introduces concepts related to data warehousing and data mining, including definitions and common uses.
The document discusses different levels of coupling between data mining (DM) systems and database/data warehouse (DB/DW) systems. It defines:
1) No coupling as DM systems operating independently without utilizing any DB/DW functions.
2) Loose coupling as DM systems fetching data from and storing results in DB/DW systems.
3) Semi-tight coupling as DM systems linking to and using efficient implementations of some DM functions within DB/DW systems.
4) Tight coupling as DM systems being fully integrated with and optimized based on the query processing and data structures of DB/DW systems.
This document outlines the learning objectives and resources for a course on data mining and analytics. The course aims to:
1) Familiarize students with key concepts in data mining like association rule mining and classification algorithms.
2) Teach students to apply techniques like association rule mining, classification, cluster analysis, and outlier analysis.
3) Help students understand the importance of applying data mining concepts across different domains.
The primary textbook listed is "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber. Topics that will be covered include introduction to data mining, preprocessing, association rules, classification algorithms, cluster analysis, and applications.
The document discusses data warehousing concepts and technologies. It defines a data warehouse as a subject-oriented, integrated, time-variant, and non-volatile collection of data used for decision making. Key aspects covered include multidimensional data modeling using facts, dimensions, and cubes; data warehouse architectures; and efficient cube computation methods such as ROLAP-based algorithms.
1) Data mining involves extracting hidden patterns from large datasets to discover useful information. It is an interdisciplinary field drawing from statistics, machine learning, database technology and more.
2) The overall goal is to extract information and transform it into an understandable structure. This includes data cleaning, integration, selection, transformation, mining patterns, and evaluating/presenting the results.
3) Data mining is used for applications like market analysis, risk analysis, fraud detection and more, across domains like business, science, health, and society. It has the potential to provide insights from vast amounts of accumulated data.
This document provides an overview of data warehousing. It defines data warehousing as collecting data from multiple sources into a central repository for analysis and decision making. The document outlines the history of data warehousing and describes its key characteristics like being subject-oriented, integrated, and time-variant. It also discusses the architecture of a data warehouse including sources, transformation, storage, and reporting layers. The document compares data warehousing to traditional DBMS and explains how data warehouses are better suited for analysis versus transaction processing.
The document discusses key concepts in data warehousing including:
1) The distinction between data and information, with data becoming valuable when organized and presented as information for decision making.
2) Characteristics of a data warehouse including being subject-oriented, integrated, non-volatile, time-variant, and accessible to end-users.
3) Differences between operational data and data warehouse data including the data warehouse being subject-oriented, summarized over time, and serving managerial communities rather than transactional needs.
UNIT - 1 Part 2: Data Warehousing and Data MiningNandakumar P
DBMS Schemas for Decision Support , Star Schema, Snowflake Schema, Fact Constellation Schema, Schema Definition, Data extraction, clean up and transformation tools.
Unit-IV-Introduction to Data Warehousing .pptxHarsha Patel
Data warehousing combines data from multiple sources to ensure data quality and accuracy. It separates analytics processing from transactional databases. A data warehouse stores historical data and allows fast querying of all data, using OLAP, while a database stores current transactions for online processing using OLTP. A multidimensional data model organizes data into cubes with dimensions and facts to allow analyzing data from different perspectives. Key components of a data warehouse architecture include external data sources, a staging area using ETL, the data warehouse, and data marts containing subsets of warehouse data.
Detailed slides of data resource management. The relationships among the many individual data elements stored in databases are based on one of several logical data structures, or models.
A data warehouse consists of several key components:
- Current detail data from operational systems of record which is stored for analysis.
- Integration and transformation programs that convert operational data into a common format for the data warehouse.
- Summarized and archived data used for reporting and analysis over time.
- Metadata that describes the structure and meaning of the data.
Data warehouses are used for standard reporting, queries on summarized data, and data mining of patterns in large datasets to gain business insights.
The document provides an introduction to data mining and knowledge discovery. It discusses how large amounts of data are extracted and transformed into useful information for applications like market analysis and fraud detection. The key steps in the knowledge discovery process are described as data cleaning, integration, selection, transformation, mining, pattern evaluation, and knowledge presentation. Common data sources, database architectures, and types of coupling between data mining systems and databases are also outlined.
Data science involves analyzing structured, semi-structured, and unstructured data to extract knowledge and insights. It employs techniques from fields like statistics, computer science, and information science. Data scientists possess strong skills in programming, statistics, data modeling, and machine learning. The data processing lifecycle involves data acquisition, analysis, curation, storage, and exploration. Big data is characterized by its volume, velocity, and variety. Technologies like Hadoop use clustered computing and distributed storage like HDFS to efficiently process and store large amounts of structured and unstructured data.
The document defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data to support management decision making. A data warehouse is maintained separately from operational databases and provides a platform for consolidated historical data analysis. Key features of a data warehouse include dimensional modeling using facts, dimensions, and star or snowflake schemas.
This document provides an overview of data science and key concepts related to emerging technologies. It describes what data science is and its role, differentiates between data and information, describes the data processing life cycle and common data types. It also discusses the basics of big data, including characteristics like volume, velocity and variety. Finally, it introduces clustered computing and components of the Hadoop ecosystem.
This document discusses various data mining techniques, including artificial neural networks. It provides an overview of the knowledge discovery in databases process and the cross-industry standard process for data mining. It also describes techniques such as classification, clustering, regression, association rules, and neural networks. Specifically, it discusses how neural networks are inspired by biological neural networks and can be used to model complex relationships in data.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
1. 20IT501 – Data Warehousing and
Data Mining
III Year / V Semester
2. UNIT II- DATA MINING
Introduction – Data – Types of Data – Data
Mining Functionalities – Interestingness of
Patterns – Classification of Data Mining Systems
– Data Mining Task Primitives – Integration of a
Data Mining System with a Data Warehouse –
Issues –Data Preprocessing
3. Data Mining
Data mining, also known as knowledge discovery in data (KDD), is
the process of uncovering patterns and other valuable information
from large data sets.
Data mining has improved organizational decision-making through
insightful data analyses.
In addition, many other terms have a similar meaning to data
mining—for example, knowledge mining from data, knowledge
extraction, data/pattern analysis, data archaeology, and data
dredging
5. Data Mining
The knowledge discovery process:
Data cleaning (to remove noise and inconsistent data)
Data integration (where multiple data sources may be
combined)
Data selection (where data relevant to the analysis task
are retrieved from the database)
6. Data Mining
The knowledge discovery process:
Data transformation (where data are transformed
and consolidated into forms appropriate for mining
by performing summary or aggregation operations)
Data mining (an essential process where intelligent
methods are applied to extract data patterns)
7. Data Mining
The knowledge discovery process:
Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on
interestingness measures)
Knowledge presentation (where visualization and
knowledge representation techniques are used to
present mined knowledge to users)
8. Data Mining
The knowledge discovery process:
Steps 1 through 4 are different forms of data
preprocessing, where data are prepared for mining.
Data Mining uncovers hidden patterns for evaluation.
Data mining is the process of discovering interesting
patterns and knowledge from large amounts of data.
9. Data Mining
The knowledge discovery process - The data
sources can include
Databases,
Data warehouses,
The Web,
Other information repositories, or
data that are streamed into the system dynamically.
11. Data Mining Architecture
Components - Database, data warehouse, WWW,
or other information repository:
This is one or a set of databases, data warehouses,
spreadsheets, or other kinds of information
repositories.
Data cleaning and data integration techniques may be
performed on the data.
12. Data Mining Architecture
Components - Database or data warehouse
server:
The database or data warehouse server is
responsible for fetching the relevant data, based on
the user’s data mining request.
13. Data Mining Architecture
Components - Knowledge base:
This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting
patterns.
Such knowledge can include concept hierarchies, used
to organize attributes or attribute values into different
levels of abstraction.
14. Data Mining Architecture
Components - Data mining engine:
This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such
as characterization, association and correlation
analysis, classification, prediction, cluster analysis,
outlier analysis, and evolution analysis.
15. Data Mining Architecture
Components - Pattern evaluation module:
This component typically employs interestingness
measures and interacts with the data mining modules
so as to focus the search toward interesting patterns.
the pattern evaluation module may be integrated with
the mining module, depending on the implementation
of the data mining method used.
16. Data Mining Architecture
Components - User interface:
This module communicates between users and the data
mining system.
The user to interact with the system by specifying a
data mining query or task, providing information to
help focus the search, and performing exploratory data
mining based on the intermediate data mining results
17. Types of Data
Database Data (or) Relational Databases:
A database system, also called a database management system
(DBMS), consists of a collection of interrelated data, known as a
database, and a set of software programs to manage and access the data.
The software programs involve mechanisms for the definition of
database structures; for data storage; for concurrent, shared, or
distributed data access; and for ensuring the consistency and security of
the information stored, despite system crashes or attempts at
unauthorized access.
18. Types of Data
Database Data (or) Relational Databases:
A relational database is a collection of tables, each ofwhich
is assigned a unique name.
Each table consists of a set of attributes (columns or fields)
and usually stores a large set of tuples (records or rows).
Each tuple in a relational table represents an object
identified by a unique key and described by a set of
attribute values.
19. Types of Data
Database Data (or) Relational Databases:
A semantic data model, such as an entity-relationship (ER)
data model, is often constructed for relational databases.
An ER data model represents the database as a set of
entities and their relationships.
Relational data can be accessed by database queries
written in a relational query language, such as SQL, or with
the assistance of graphical user interfaces.
20. Types of Data
Database Data (or) Relational Databases:
Example: Data mining systems can analyze customer data
to predict the credit risk of new customers based on their
income, age, and previous credit information.
Data mining systems may also detect deviations—that is,
items with sales that are far from those expected in
comparison with the previous year. Such deviations can
then be further investigated.
21. Types of Data
Database Data (or) Relational Databases:
Relational databases are one of the most
commonly available and rich information
repositories, and thus they are a major data form in
our study of data mining.
22. Types of Data
Data Warehouses:
A data warehouse is a repository of information
collected from multiple sources, stored under a unified
schema, and that usually resides at a single site.
Data warehouses are constructed via a process of data
cleaning, data integration, data transformation, data
loading, and periodic data refreshing.
23. Types of Data
Data Warehouses:
To facilitate decision making, the data in a data
warehouse are organized around major subjects, such
as customer, item, supplier, and activity.
The data are stored to provide information from a
historical perspective (such as from the past 5–10
years) and are typically summarized.
24. Types of Data
Data Warehouses:
data warehouse is usually modeled by a multidimensional
database structure, where each dimension corresponds to an
attribute or a set of attributes in the schema, and each cell stores
the value of some aggregate measure, such as count or sales
amount.
A data cube provides a multidimensional view of data and
allows the precomputation and fast accessing of summarized
data.
25. Types of Data
Transactional Databases
In general, a transactional database consists of a
file where each record represents a transaction.
A transaction typically includes a unique
transaction identity number (trans ID) and a list of
the items making up the transaction
26. Types of Data
Other Kinds of Data
Time-related or sequence data (e.g., historical records, stock
exchange data, and time-series and biological sequence data),
Data streams (e.g., video surveillance and sensor data, which are
continuously transmitted), spatial data (e.g., maps),
Engineering design data (e.g., the Design of buildings, system
components, or integrated circuits),
27. Types of Data
Other Kinds of Data
Hypertext and multimedia data (including text, image,
video, and audio data),
Graph and networked data (e.g., social and information
networks), and
The Web (a huge, widely distributed information
repository made available by the Internet).
28. Data Mining Functionalities
Data mining functionalities are used to specify the
kinds of patterns to be found in data mining tasks.
Descriptive mining tasks characterize properties of the
data in a target data set.
Predictive mining tasks perform induction on the
current data in order to make predictions.
29. Data Mining Functionalities
Concept/Class Description: Characterization and
Discrimination:
Data can be associated with classes or concepts.
Example: Student Reg. No and Student Name belongs to
Student class.
It can be useful to describe individual classes and concepts
in summarized, concise, and yet precise terms.
30. Data Mining Functionalities
Concept/Class Description: Characterization and
Discrimination:
Data characterization, by summarizing the data of the class
under study (often called the target class) in general terms.
Data discrimination, by comparison of the target class with one
or a set of comparative classes (often called the contrasting
classes), or
Both data characterization and discrimination.
31. Data Mining Functionalities
Concept/Class Description: Characterization and Discrimination:
The data cube–based OLAP roll-up operation can be used to perform
user-controlled data summarization along a specified dimension.
The output of data characterization can be presented in various forms.
Examples: Pie and bar charts, curves and multidimensional data cubes
32. Data Mining Functionalities
Concept/Class Description: Characterization and Discrimination:
In Data discrimination, The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved
through database queries.
For example, the user may like to compare the general features of
software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
33. Data Mining Functionalities
Mining Frequent Patterns, Associations, and
Correlations:
Frequent patterns, as the name suggests, are patterns
that occur frequently in data.
A frequent itemset typically refers to a set of items
that frequently appear together in a transactional data
set, such as milk and bread.
34. Data Mining Functionalities
Mining Frequent Patterns, Associations, and
Correlations:
A frequently occurring subsequence, such as the
pattern that customers tend to purchase first a PC,
followed by a digital camera, and then a memory
card, is a (frequent) sequential pattern.
35. Data Mining Functionalities
Classification and Prediction
Classification is the process of finding a model (or
function) that describes and distinguishes data
classes or concepts, for the purpose of being able
to use the model to predict the class of objects
whose class label is unknown.
36. Data Mining Functionalities
Classification and Prediction
The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision
trees, mathematical formulae, or neural networks.
classification predicts categorical (discrete,
unordered) labels, regression models predicts
continuous-valued functions.
37. Data Mining Functionalities
Classification and Prediction
Regression is used to predict missing or
unavailable numerical data values rather than
(discrete) class labels. The term prediction refers to
both numeric prediction and class label prediction.
38. Data Mining Functionalities
Cluster Analysis
Clustering analyzes data objects without consulting class
labels.
Clustering can be used to generate class labels for a group
of data.
The objects are clustered or grouped based on the principle
of maximizing the intraclass similarity and minimizing the
interclass similarity.
39. Data Mining Functionalities
Cluster Analysis
Clusters of objects are formed so that objects
within a cluster have high similarity in comparison
to one another, but are rather dissimilar to objects
in other clusters.
40. Data Mining Functionalities
Outlier Analysis:
A data set may contain objects that do not comply
with the general behavior or model of the data.
These data objects are outliers.
Many data mining methods discard outliers as
noise or exceptions.
41. Data Mining Functionalities
Outlier Analysis:
However, in some applications (e.g., fraud
detection) the rare events can be more interesting
than the more regularly occurring ones.
The analysis of outlier data is referred to as outlier
analysis or anomaly mining.
42. Data Mining Functionalities
Evolution Analysis:
Data evolution analysis describes and models
regularities or trends for objects whose behavior
changes over time.
Example: Stock market (time-series) data
43. Interestingness of Patterns
A pattern is interesting if it is
Easily understood by humans,
Valid on new or test data with some degree of certainty,
Potentially useful and
Novel
An interesting pattern represents knowledge
44. Interestingness of Patterns
Several objective measures of pattern
interestingness exist.
An objective measure for association rules of the
form XY is rule support.
Another objective measure for association rules is
confidence.
46. Classification of Data Mining
Systems
Classification according to the kinds of databases
mined:
Database systems can be classified according to different
criteria (such as data models, or the types of data or
applications involved), each of which may require its own
data mining technique.
Data mining systems can therefore be classified
accordingly.
47. Classification of Data Mining
Systems
Classification according to the kinds of databases mined:
For instance, if classifying according to data models, we may
have a relational, transactional, object-relational, or data
warehouse mining system.
If classifying according to the special types of data handled, we
may have a spatial, time-series, text, stream data, multimedia
data mining system, or a World Wide Web mining system
48. Classification of Data Mining
Systems
Classification according to the kinds of
knowledge mined:
It is based on data mining functionalities, such as
characterization, discrimination, association and
correlation analysis, classification, prediction,
clustering, outlier analysis, and evolution analysis.
49. Classification of Data Mining
Systems
Classification according to the kinds of techniques utilized:
The techniques can be described according to the degree of user
interaction involved (e.g., autonomous systems, interactive
exploratory systems, query-driven systems) or
The methods of data analysis employed (e.g., database-oriented
or data warehouse– oriented techniques, machine learning,
statistics, visualization, pattern recognition, neural networks, and
so on).
50. Classification of Data Mining
Systems
Classification according to the applications
adapted:
Data mining systems may be tailored specifically for
finance, telecommunications, DNA, stock markets, e-
mail, and so on.
All-purpose data mining system may not fit domain-
specific mining tasks.
51. Data Mining Task Primitives
A data mining task can be specified in the form of a data
mining query, which is input to the data mining system.
A data mining query is defined in terms of data mining task
primitives.
These primitives allow the user to interactively
communicate with the data mining system during discovery
in order to direct the mining process, or examine the
findings from different angles or depths
52. Data Mining Task Primitives
The set of task-relevant data to be mined:
This specifies the portions of the database or the
set of data in which the user is interested.
This includes the database attributes or data
warehouse dimensions of interest.
53. Data Mining Task Primitives
The kind of knowledge to be mined:
This specifies the data mining functions to be
performed, such as characterization,
discrimination, association or correlation analysis,
classification, prediction, clustering, outlier
analysis, or evolution analysis
54. Data Mining Task Primitives
The background knowledge to be used in the
discovery process:
This knowledge about the domain to be mined is
useful for guiding the knowledge discovery
process and for evaluating the patterns found.
55. Data Mining Task Primitives
The interestingness measures and thresholds for
pattern evaluation:
They may be used to guide the mining process or, after
discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different
interestingness measures.
Example: Support and Confidence
56. Data Mining Task Primitives
The expected representation for visualizing the
discovered patterns:
This refers to the form in which discovered
patterns are to be displayed, which may include
rules, tables, charts, graphs, decision trees, and
cubes.
57. Integration of a Data Mining System with a
Database or DataWarehouse System
Good system architecture will facilitate the data mining
system:
to make best use of the software environment
accomplish data mining tasks in an efficient and timely manner
interoperate and exchange information with other information
systems
be adaptable to users’ diverse requirements, and evolve with
time
58. Integration of a Data Mining System with a
Database or DataWarehouse System
The design of a data mining (DM) system is how to
integrate or couple the DM system with a database
(DB) system and/or a data warehouse (DW) system.
If a DM system works as a stand-alone system or is
embedded in an application program, there are no DB
or DW systems with which it has to communicate.
59. Integration of a Data Mining System with a
Database or DataWarehouse System
No coupling:
A DM system will not utilize any function of a DB or
DW system.
It may fetch data from a particular source (such as a
file system), process data using some data mining
algorithms, and then store the mining results in another
file.
60. Integration of a Data Mining System with a
Database or DataWarehouse System
No coupling – Drawbacks:
First, a DB system provides a great deal of flexibility
and efficiency at storing, organizing, accessing, and
processing data.
Without using a DB/DW system, a DM system may
spend a substantial amount of time finding, collecting,
cleaning, and transforming data.
61. Integration of a Data Mining System with a
Database or DataWarehouse System
No coupling – Drawbacks:
Second, there are many tested, scalable algorithms and
data structures implemented in DB and DW systems.
It is feasible to realize efficient, scalable implementations
using such systems.
Moreover, most data have been or will be stored in
DB/DW systems
62. Integration of a Data Mining System with a
Database or DataWarehouse System
Loose coupling:
A DM system will use some facilities of a DB or DW
system, fetching data from a data repository managed
by these systems, performing data mining, and then
storing the mining results either in a file or in a
designated place in a database or data warehouse.
63. Integration of a Data Mining System with a
Database or DataWarehouse System
Loose coupling:
Loose coupling is better than no coupling because it
can fetch any portion of data stored in databases or
data warehouses by using query processing, indexing,
and other system facilities.
It incurs some advantages of the flexibility, efficiency,
and other features provided by such systems
64. Integration of a Data Mining System with a
Database or DataWarehouse System
Loose coupling:
Many loosely coupled mining systems are main memory-
based
Because mining does not explore data structures and query
optimization methods provided by DB or DW systems.
It is difficult for loose coupling to achieve high scalability
and good performance with large data sets.
65. Integration of a Data Mining System with a
Database or DataWarehouse System
Semitight coupling:
Semitight coupling means that besides linking a DM
system to a DB/DW system, efficient implementations of a
few essential data mining primitives can be provided in the
DB/DW system.
Some frequently used intermediate mining results can be
precomputed and stored in the DB/DW system.
This design will enhance the performance of a DM system
66. Integration of a Data Mining System with a
Database or DataWarehouse System
Tight coupling:
A DM system is smoothly integrated into the DB/DW system.
Data mining queries and functions are optimized based on
mining query analysis, data structures, indexing schemes, and
query processing methods of a DB or DW system.
With further technology advances, DM, DB, and DW systems
will evolve and integrate together as one information system
with multiple functionalities.
67. Major Issues in Data Mining
Mining methodology and user interaction issues:
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of
abstraction
Incorporation of background knowledge
Data mining query languages and ad hoc data mining
68. Major Issues in Data Mining
Mining methodology and user interaction
issues:
Presentation and visualization of data mining
results
Handling noisy or incomplete data
Pattern evaluation—the interestingness problem
69. Major Issues in Data Mining
Performance issues:
Efficiency and scalability of data mining
algorithms
Parallel, distributed, and incremental mining
algorithms
70. Major Issues in Data Mining
Issues relating to the diversity of database
types:
Handling of relational and complex types of data
Mining information from heterogeneous databases
and global information systems
71. Data Preprocessing
Why Preprocess the Data?
Incomplete, noisy, and inconsistent data are
common place properties of large real world
databases and data warehouses.
72. Data Preprocessing
Why Preprocess the Data?
Incomplete data can occur for a number of
reasons.
Attributes of interest may not always be available
Other data may not be included simply because it was
not considered important at the time of entry
73. Data Preprocessing
Why Preprocess the Data?
Incomplete data can occur for a number of reasons.
Relevant data may not be recorded due to a
misunderstanding, or because of equipment malfunctions.
Data that were inconsistent with other recorded data may
have been deleted.
Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.
74. Data Preprocessing
Why Preprocess the Data?
There are many possible reasons for noisy data.
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data
entry.
Errors in data transmission can also occur.
Incorrect data may also result from inconsistencies in naming
conventions or data codes used, or inconsistent formats for input
fields, such as date.
75. Data Preprocessing
Descriptive Data Summarization
Descriptive data summarization techniques can be used to
identify the typical properties of your data and highlight
which data values should be treated as noise or outliers.
For many data preprocessing tasks, users would like to
learn about data characteristics regarding both central
tendency and dispersion of the data.
76. Data Preprocessing
Descriptive Data Summarization
Measures of central tendency include mean,
median, mode, and midrange, while measures of
data dispersion include quartiles, interquartile
range (IQR), and variance.
77. Data Preprocessing
Descriptive Data Summarization - Measures
of central tendency:
The most common and most effective numerical
measure of the “center” of a set of data is the
(arithmetic) mean.
78. Data Preprocessing
Descriptive Data Summarization - Measures of central
tendency:
A distributive measure is a measure (i.e., function) that can
be computed for a given data set by partitioning the data
into smaller subsets, computing the measure for each
subset, and then merging the results in order to arrive at the
measure’s value for the original (entire) data set.
Example: sum(), count(), max() and min()
79. Data Preprocessing
Descriptive Data Summarization - Measures of
central tendency:
An algebraic measure is a measure that can be
computed by applying an algebraic function to one or
more distributive measures.
Example: average() is an algebraic measure because it
can be computed by sum()/count()
80. Data Preprocessing
Descriptive Data Summarization - Measures of
central tendency:
The mean is the single most useful quantity for
describing a data set, it is not always the best way of
measuring the center of the data.
A major problem with the mean is its sensitivity to
extreme (e.g., outlier) values.
81. Data Preprocessing
Descriptive Data Summarization - Measures of central
tendency:
For skewed (asymmetric) data, a better measure of the
center of data is the median.
Suppose that a given data set of N distinct values is sorted
in numerical order. If N is odd, then the median is the
middle value of the ordered set; otherwise (i.e., if N is
even), the median is the average of the middle two values
82. Data Preprocessing
Descriptive Data Summarization - Measures of central
tendency:
A holistic measure is a measure that must be computed on the
entire data set as a whole.
It cannot be computed by partitioning the given data into subsets
and merging the values obtained for the measure in each subset.
Example: Median
83. Data Preprocessing
Descriptive Data Summarization - Measures of central
tendency:
Another measure of central tendency is the mode. The
mode for a set of data is the value that occurs most
frequently in the set.
It is possible for the greatest frequency to correspond to
several different values, which results in more than one
mode.
84. Data Preprocessing
Descriptive Data Summarization - Measures of
central tendency:
Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal.
In general, a data set with two or more modes is
multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.
85. Data Preprocessing
Descriptive Data Summarization - Measuring the Dispersion of
Data:
The degree to which numerical data tend to spread is called the
dispersion, or variance of the data.
The most common measures of data dispersion are range, the five-
number summary (based on quartiles), the interquartile range, and the
standard deviation.
Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying outliers.
86. Data Preprocessing
Descriptive Data Summarization - Measuring the
Dispersion of Data:
The range of the set is the difference between the
largest (max()) and smallest (min()) values.
The kth percentile of a set of data in numerical order is
the value xi having the property that k percent of the
data entries lie at or below xi.
87. Data Preprocessing
Descriptive Data Summarization - Measuring the
Dispersion of Data:
The most commonly used percentiles other than the
median are quartiles.
The first quartile, denoted by Q1, is the 25th
percentile; the third quartile, denoted by Q3, is the
75th percentile.
88. Data Preprocessing
Descriptive Data Summarization - Measuring the
Dispersion of Data:
The distance between the first and third quartiles is a
simple measure of spread that gives the range covered
by the middle half of the data. This distance is called
the interquartile range (IQR) and is defined as
IQR = Q3 – Q1
89. Data Preprocessing
Descriptive Data Summarization - Measuring the
Dispersion of Data:
The five-number summary of a distribution consists of
the median, the quartiles Q1 and Q3, and the smallest
and largest individual observations, written in the
order
Minimum; Q1; Median; Q3; Maximum
90. Data Preprocessing
Descriptive Data Summarization - Measuring the
Dispersion of Data:
Boxplots are a popular way of visualizing a distribution. A
boxplot incorporates the five-number summary as follows:
Typically, the ends of the box are at the quartiles, so that the box
length is the interquartile range, IQR.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest
(Minimum) and largest (Maximum) observations
92. Data Preprocessing
Graphic Displays of Basic Descriptive Data
Summaries:
Plotting histograms, or frequency histograms, is a
graphical method for summarizing the distribution of a
given attribute.
A histogram for an attribute A partitions the data
distribution of A into disjoint subsets, or buckets
93. Data Preprocessing
Graphic Displays of Basic Descriptive Data
Summaries:
Typically, the width of each bucket is uniform.
Each bucket is represented by a rectangle whose
height is equal to the count or relative frequency of
the values at the bucket.
95. Data Preprocessing
Graphic Displays of Basic Descriptive Data
Summaries - Quantile plot:
A simple and effective way to have a first look at a
univariate data distribution.
First, it displays all of the data for the given attribute.
Second, it plots quantile information.
96. Data Preprocessing
Graphic Displays of Basic Descriptive Data
Summaries – Scatter plot:
The most effective graphical methods for determining if
there appears to be a relationship, pattern, or trend between
two numerical attributes.
To construct a scatter plot, each pair of values is treated as
a pair of coordinates in an algebraic sense and plotted as
points in the plane.
98. Data Cleaning
Real-world data tend to be incomplete, noisy, and
inconsistent.
Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise
while identifying outliers, and correct
inconsistencies in the data.
99. Data Cleaning
Missing Values:
Ignore the tuple: This is usually done when the class label
is missing. This method is not very effective, unless the
tuple contains several attributes with missing values.
Fill in the missing value manually: In general, this
approach is time-consuming and may not be feasible given
a large data set with many missing values.
100. Data Cleaning
Missing Values:
Use a global constant to fill in the missing value: Replace
all missing attribute values by the same constant, such as a
label like “Unknown”.
Use the attribute mean to fill in the missing value: For
example, suppose that the average income of customers is
56,000. Use this value to replace the missing value for
income
101. Data Cleaning
Missing Values:
Use the attribute mean for all samples belonging to
the same class as the given tuple: For example, if
classifying customers according to credit risk, replace
the missing value with the average income value for
customers in the same credit risk category as that of
the given tuple.
102. Data Cleaning
Missing Values:
Use the most probable value to fill in the missing
value: For example, using the other customer
attributes in your data set, you may construct a
decision tree to predict the missing values for
income.
103. Data Cleaning
Noisy Data:
Noise is a random error or variance in a measured
variable.
Methods:
Binning, Regression and Clustering:
104. Data Cleaning
Noisy Data – Binning:
Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it.
The sorted values are distributed into a number of
“buckets,” or bins.
Because binning methods consult the neighborhood of
values, they perform local smoothing
105. Data Cleaning
Noisy Data – Binning:
Example: Data: 4,8,15,21,21,24,25,28,34
Partition into (equal-frequency) bins:
Bin 1: 4,8,15
Bin 2: 21,21,24
Bin 3: 25,28,34
106. Data Cleaning
Noisy Data – Binning:
Example: Data: 4,8,15,21,21,24,25,28,34
Smoothing by bin means: each value in a bin is
replaced by the mean value of the bin
Bin 1: 9,9,9
Bin 2: 22,22,22
Bin 3: 29,29,29
107. Data Cleaning
Noisy Data – Binning:
Example: Data: 4,8,15,21,21,24,25,28,34
Smoothing by bin boundaries: The minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value
Bin 1: 4,4,15
Bin 2: 21,21,24
Bin 3: 25,25,34
108. Data Cleaning
Noisy Data – Regression:
Data can be smoothed by fitting the data to a
function, such as with regression.
Linear regression involves finding the “best” line
to fit two attributes (or variables), so that one
attribute can be used to predict the other.
109. Data Cleaning
Noisy Data – Clustering:
Outliers may be detected by clustering, where
similar values are organized into groups, or
“clusters.”
Intuitively, values that fall outside of the set of
clusters may be considered outliers
110. Data Cleaning
Data Cleaning as a Process:
Missing values, noise, and inconsistencies
contribute to inaccurate data.
The first step in data cleaning as a process is
discrepancy detection.
111. Data Cleaning
Data Cleaning as a Process - Discrepancies can be caused by
several factors,
poorly designed data entry forms that have many optional fields
human error in data entry
deliberate errors (e.g., respondents not wanting to disclose information
about themselves)
data decay (e.g., outdated addresses)
Errors in instrumentation devices that record data, and system errors
112. Data Cleaning
Data Cleaning as a Process - Discrepancies can be caused
by several factors,
poorly designed data entry forms that have many optional fields
human error in data entry
deliberate errors (e.g., respondents not wanting to disclose
information about themselves)
data decay (e.g., outdated addresses).
113. Data Cleaning
Data Cleaning as a Process – Discrepancies Detection
Tools:
Data scrubbing tools use simple domain knowledge (e.g.,
knowledge of postal addresses, and spell-checking) to
detect errors and make corrections in the data
Data auditing tools find discrepancies by analyzing the
data to discover rules and relationships, and detecting data
that violate such conditions.
114. Data Cleaning
Data Cleaning as a Process – Discrepancies Detection
Tools:
Data migration tools allow simple transformations to be
specified, such as to replace the string “gender” by “sex”
ETL (extraction/transformation/loading) tools allow users
to specify transforms through a graphical user interface
(GUI).
115. Data Integration and
Transformation
Data Integration:
It combines data from multiple sources into a
coherent data store, as in data warehousing. These
sources may include multiple databases, data
cubes, or flat files.
116. Data Integration and
Transformation
Data Integration – Issues:
Schema integration and object matching can be tricky.
For example, how can the data analyst or the
computer be sure that customer id in one database and
cust number in another refer to the same attribute.
Metadata can be used to help avoid errors in schema
integration
117. Data Integration and
Transformation
Data Integration – Issues:
Metadata for each attribute include the name,
meaning, data type, and range of values permitted
for the attribute, and null rules for handling blank,
zero, or null values.
118. Data Integration and
Transformation
Data Integration – Issues:
Redundancy is another important issue.
Inconsistencies in attribute or dimension naming can
also cause redundancies in the resulting data set.
Some redundancies can be detected by correlation
analysis.
119. Data Integration and
Transformation
Data Integration – Issues:
A third important issue in data integration is the
detection and resolution of data value conflicts.
For a hotel chain, the price of rooms in different cities
may involve not only different currencies but also
different services (such as free breakfast) and taxes.
120. Data Integration and
Transformation
Data Integration:
The semantic heterogeneity and structure of data pose
great challenges in data integration.
Careful integration of the data from multiple sources can
help reduce and avoid redundancies and inconsistencies in
the resulting data set.
This can help improve the accuracy and speed of the
subsequent mining process.
122. Data Integration and
Transformation
Data Transformation:
Smoothing, which works to remove noise from the data. Such
techniques include binning, regression, and clustering.
Aggregation, where summary or aggregation operations are
applied to the data.
Generalization of the data, where low-level or “primitive” (raw)
data are replaced by higher-level concepts through the use of
concept hierarchies
123. Data Integration and
Transformation
Data Transformation:
Normalization, where the attribute data are scaled so
as to fall within a small specified range, such as -1.0 to
1.0, or 0.0 to 1.0.
Attribute construction (or feature construction),where
new attributes are constructed and added from the
given set of attributes to help the mining process.
124. Data Integration and
Transformation
Data Transformation - There are many
methods for data normalization.
Min-max normalization,
Z-score normalization,
Normalization by decimal scaling.
125. Data Integration and
Transformation
Min-max normalization:
Min-max normalization preserves the
relationships among the original data values.
It will encounter an “out-of-bounds” error if a
future input case for normalization falls outside of
the original data range for A.
126. Data Integration and
Transformation
Min-max normalization:
Suppose that the minimum and maximum values for
the attribute income are 12,000 and 98,000,
respectively, to map income to the range [0:0;1:0].
By min-max normalization, a value of $73,600 for
income is transformed to – 0.76
127. Data Integration and
Transformation
Z-score normalization:
The values for an attribute, A, are normalized
based on the mean and standard deviation of A. A
value, v, of A is normalized to v’ by computing
128. Data Integration and
Transformation
Z-score normalization:
The mean and standard deviation of the values for
the attribute income are $54,000 and $16,000,
respectively. A value of $73,600 for income is
transformed to 1.225
129. Data Integration and
Transformation
Normalization by decimal scaling
Normalizes by moving the decimal point of values
of attribute A.
The number of decimal points moved depends on
the maximum absolute value of A.
130. Data Integration and
Transformation
Normalization can change the original data quite a bit.
It is also necessary to save the normalization
parameters (such as the mean and standard deviation if
using z-score normalization) so that future data can be
normalized in a uniform manner.
131. Data Integration and
Transformation
Attribute Construction:
New attributes are constructed from the given
attributes and added in order to help improve the
accuracy and understanding of structure in high-
dimensional data.
To add the attribute area based on the attributes height
and width.
132. Data Reduction
To obtain a reduced representation of the data
set that is much smaller in volume, yet closely
maintains the integrity of the original data.
133. Data Reduction
Strategies
Data cube aggregation
Attribute subset selection
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
134. Data Reduction
Strategies - Data cube aggregation
The data can be aggregated so that the resulting
data summarize the total sales per year instead of
per quarter.
135. Data Reduction
Strategies - Data cube aggregation
The cube created at the lowest level of abstraction is
referred to as the base cuboid.
The base cuboid should correspond to an individual
entity of interest, such as sales or customer.
In other words, the lowest level should be usable, or
useful for the analysis.
136. Data Reduction
Strategies - Data cube aggregation
A cube at the highest level of abstraction is the
apex cuboid.
The apex cuboid would give one total
137. Data Reduction
Strategies - Attribute Subset Selection
Attribute subset selection reduces the data set size by
removing irrelevant or redundant attributes (or
dimensions).
To find a minimum set of attributes such that the resulting
probability distribution of the data classes is as close as
possible to the original distribution obtained using all
attributes.
138. Data Reduction
Strategies - Attribute Subset Selection
Stepwise forward selection: The procedure starts with
an empty set of attributes as the reduced set.
The best of the original attributes is determined and
added to the reduced set.
At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.
139. Data Reduction
Strategies - Attribute Subset Selection
Stepwise backward elimination: The procedure
starts with the full set of attributes. At each step, it
removes the worst attribute remaining in the set.
140. Data Reduction
Strategies - Attribute Subset Selection
Combination of forward selection and backward
elimination: The stepwise forward selection and
backward elimination methods can be combined so
that, at each step, the procedure selects the best
attribute and removes the worst from among the
remaining attributes.
141. Data Reduction
Strategies - Attribute Subset Selection
Decision tree induction: Decision tree algorithms, such as ID3,
C4.5, and CART, were originally intended for classification.
Decision tree induction constructs a flowchart like structure
where each internal (nonleaf) node denotes a test on an attribute,
each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction.
142. Data Reduction
Dimensionality Reduction:
In dimensionality reduction, data encoding or
transformations are applied so as to obtain a reduced or
“compressed” representation of the original data.
Methods: wavelet transforms and principal
components analysis.
143. Data Reduction
Dimensionality Reduction - Wavelet
Transforms:
The discrete wavelet transform(DWT) is a linear
signal processing technique that, when applied to a
data vector X, transforms it to a numerically
different vector, X0, of wavelet coefficients.
144. Data Reduction
Dimensionality Reduction - Wavelet Transforms:
A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
For example, all wavelet coefficients larger than some user-specified
threshold can be retained. All other coefficients are set to 0.
The technique also works to remove noise without smoothing out the
main features of the data, making it effective for data cleaning as well.
145. Data Reduction
Dimensionality Reduction - Principal
Components Analysis:
It searches for k n-dimensional orthogonal vectors that
can best be used to represent the data.
The original data are thus projected onto a much
smaller space, resulting in dimensionality reduction.
146. Data Reduction
Dimensionality Reduction - Numerosity Reduction:
For parametric methods, a model is used to estimate the
data, so that typically only the data parameters need to be
stored, instead of the actual data.
Nonparametric methods for storing reduced
representations of the data include histograms, clustering,
and sampling.
147. Data Reduction
Dimensionality Reduction - Numerosity Reduction –
Histograms:
A histogram for an attribute, A, partitions the data distribution of
A into disjoint subsets, or buckets.
Each bucket represents only a single attribute-value/frequency
pair, the buckets are called singleton buckets.
Often, buckets instead represent continuous ranges for the given
attribute.
148. Data Reduction
Histograms - Partitioning rules:
Equal-width: In an equal-width histogram, the width
of each bucket range is uniform
Equal-frequency (or equidepth): In an equal-
frequency histogram, the buckets are created so that,
roughly, the frequency of each bucket is constant
149. Data Reduction
Histograms - Partitioning rules:
Histogram variance is a weighted sum of the original
values that each bucket represents, where bucket weight is
equal to the number of values in the bucket.
MaxDiff: The difference between each pair of adjacent
values. A bucket boundary is established between each pair
for pairs having the -1 largest differences, where is the
user-specified number of buckets.
150. Data Reduction
Dimensionality Reduction - Numerosity
Reduction – Sampling:
Sampling can be used as a data reduction
technique because it allows a large data set to be
represented by a much smaller random sample (or
subset) of the data.
151. Data Reduction
Sampling:
Simple random sample without replacement
(SRSWOR) of size s:
all tuples are equally likely to be sampled.
Cluster sample: If the tuples in D are grouped into M
mutually disjoint “clusters,” then an SRS of s clusters
can be obtained, where s < M.
152. Data Reduction
Sampling:
Stratified sample: If D is divided into mutually
disjoint parts called strata, a stratified sample of D
is generated by obtaining an SRS at each stratum.
This helps ensure a representative sample,
especially when the data are skewed.
153. Data Reduction
Sampling:
An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample, s, as
opposed to N, the data set size.
When applied to data reduction, sampling is most commonly
used to estimate the answer to an aggregate query. It is possible
(using the central limit theorem) to determine a sufficient sample
size for estimating a given function within a specified degree of
error.