This document defines key concepts related to data in data mining. It discusses that data consists of objects and their attributes. Attributes can take on different values and can be nominal, ordinal, interval, or ratio. Attributes can also be discrete or continuous. Different types of data sets are discussed including records, graphs, ordered, and unstructured data. Important characteristics of structured data like dimensionality and sparsity are also covered.
This document discusses informed search strategies and local search algorithms for optimization problems. It covers best-first search, greedy search, A* search, heuristic functions, hill-climbing search, and escaping local optima. Specifically, it provides examples of applying greedy search, A* search, and hill-climbing to solve the 8-puzzle problem and discusses the drawbacks of hill-climbing getting stuck at local maxima.
Forward chaining is a data-driven reasoning method that applies rules to existing facts to deduce new facts, adding them to the knowledge base. It starts with known facts and uses inference rules to reach a goal or conclusion. Backward chaining is a goal-driven method that starts with a desired goal and works backwards to see if existing facts and rules can support reaching that goal. Both methods have tradeoffs in efficiency depending on whether the starting point is facts or a specific goal.
This document discusses runtime environments and storage allocation strategies. It covers:
- How procedure activations are represented at runtime using activation records, control stacks, and activation trees. Activation records store local variables, parameters, return values, and more.
- Different strategies for allocating storage at runtime, including static allocation where sizes are known at compile time, stack allocation for procedure activations and recursion, and heap allocation for dynamic memory.
- How names are bound to values at compile time through environments and at runtime through states. The scope and lifetime of bindings are also discussed.
- Issues related to mapping names to storage locations and values at runtime, including how assignments change the state but not the environment.
-
The document discusses knowledge representation in cognitive science and artificial intelligence. It describes several ways of representing knowledge, including predicate logic, semantic networks, frames, and conceptual dependency networks. Semantic networks represent knowledge through interconnected nodes and labeled arcs, allowing for inheritance of properties up hierarchical structures. They provide an intuitive way to represent taxonomically structured knowledge but have limitations representing logical statements.
I. Hill climbing algorithm II. Steepest hill climbing algorithmvikas dhakane
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
The document discusses brute force and exhaustive search approaches to solving problems. It provides examples of how brute force can be applied to sorting, searching, and string matching problems. Specifically, it describes selection sort and bubble sort as brute force sorting algorithms. For searching, it explains sequential search and brute force string matching. It also discusses using brute force to solve the closest pair, convex hull, traveling salesman, knapsack, and assignment problems, noting that brute force leads to inefficient exponential time algorithms for TSP and knapsack.
In this presentation, Dmitry Khlebnikov sets forward six broad principles for designing secure IT systems. He also provides a comprehensive overview of "Host-based Security"
Random Oracle Model & Hashing - Cryptography & Network SecurityMahbubur Rahman
This document discusses hashing and the random oracle model. It defines cryptographic hash functions as deterministic functions that map arbitrary strings to fixed-length outputs in a way that appears random. The random oracle model assumes an ideal hash function that behaves like a random function. The document discusses collision resistance, preimage resistance, and birthday attacks as they relate to finding collisions or preimages with a given hash function. It provides examples of calculating the number of messages an attacker would need to find collisions or preimages with different probabilities. The document concludes by listing some applications of cryptographic hash functions like password storage, file authenticity, and digital signatures.
This document discusses informed search strategies and local search algorithms for optimization problems. It covers best-first search, greedy search, A* search, heuristic functions, hill-climbing search, and escaping local optima. Specifically, it provides examples of applying greedy search, A* search, and hill-climbing to solve the 8-puzzle problem and discusses the drawbacks of hill-climbing getting stuck at local maxima.
Forward chaining is a data-driven reasoning method that applies rules to existing facts to deduce new facts, adding them to the knowledge base. It starts with known facts and uses inference rules to reach a goal or conclusion. Backward chaining is a goal-driven method that starts with a desired goal and works backwards to see if existing facts and rules can support reaching that goal. Both methods have tradeoffs in efficiency depending on whether the starting point is facts or a specific goal.
This document discusses runtime environments and storage allocation strategies. It covers:
- How procedure activations are represented at runtime using activation records, control stacks, and activation trees. Activation records store local variables, parameters, return values, and more.
- Different strategies for allocating storage at runtime, including static allocation where sizes are known at compile time, stack allocation for procedure activations and recursion, and heap allocation for dynamic memory.
- How names are bound to values at compile time through environments and at runtime through states. The scope and lifetime of bindings are also discussed.
- Issues related to mapping names to storage locations and values at runtime, including how assignments change the state but not the environment.
-
The document discusses knowledge representation in cognitive science and artificial intelligence. It describes several ways of representing knowledge, including predicate logic, semantic networks, frames, and conceptual dependency networks. Semantic networks represent knowledge through interconnected nodes and labeled arcs, allowing for inheritance of properties up hierarchical structures. They provide an intuitive way to represent taxonomically structured knowledge but have limitations representing logical statements.
I. Hill climbing algorithm II. Steepest hill climbing algorithmvikas dhakane
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
The document discusses brute force and exhaustive search approaches to solving problems. It provides examples of how brute force can be applied to sorting, searching, and string matching problems. Specifically, it describes selection sort and bubble sort as brute force sorting algorithms. For searching, it explains sequential search and brute force string matching. It also discusses using brute force to solve the closest pair, convex hull, traveling salesman, knapsack, and assignment problems, noting that brute force leads to inefficient exponential time algorithms for TSP and knapsack.
In this presentation, Dmitry Khlebnikov sets forward six broad principles for designing secure IT systems. He also provides a comprehensive overview of "Host-based Security"
Random Oracle Model & Hashing - Cryptography & Network SecurityMahbubur Rahman
This document discusses hashing and the random oracle model. It defines cryptographic hash functions as deterministic functions that map arbitrary strings to fixed-length outputs in a way that appears random. The random oracle model assumes an ideal hash function that behaves like a random function. The document discusses collision resistance, preimage resistance, and birthday attacks as they relate to finding collisions or preimages with a given hash function. It provides examples of calculating the number of messages an attacker would need to find collisions or preimages with different probabilities. The document concludes by listing some applications of cryptographic hash functions like password storage, file authenticity, and digital signatures.
Means-Ends Analysis
Ways to play
Game trees
Game Tree and Heuristic Evaluation
Minimax Evaluation of Game Trees
Minimax with Alpha-Beta Pruning
Game tree numericals
This document discusses handling uncertainty through probabilistic reasoning and machine learning techniques. It covers sources of uncertainty like incomplete data, probabilistic effects, and uncertain outputs from inference. Approaches covered include Bayesian networks, Bayes' theorem, conditional probability, joint probability distributions, and Dempster-Shafer theory. It provides examples of calculating conditional probabilities and using Bayes' theorem. Bayesian networks are defined as directed acyclic graphs representing probabilistic dependencies between variables, and examples show how to represent domains of uncertainty and perform probabilistic reasoning using a Bayesian network.
This lecture is all about General problem Solver, a universal Problem Solving Machine using Same Base Algorithm.
and is for BS computer Science Students.
it is only for learning purpose, is not that much professional, there may be errors or mistakes, therefore corrections and suggestions are welcome.
The document discusses different knowledge representation schemes used in artificial intelligence systems. It describes semantic networks, frames, propositional logic, first-order predicate logic, and rule-based systems. For each technique, it provides facts about how knowledge is represented and examples to illustrate their use. The goal of knowledge representation is to encode knowledge in a way that allows inferencing and learning of new knowledge from the facts stored in the knowledge base.
This document discusses modal logics and formalisms. It defines modal logics as logics that add new logical constants like necessity (□) and possibility (◇) to classical logic. It describes how modal logics can be classified based on whether they are extended logics that add new well-formed formulas or deviant logics that interpret the usual logical constants differently. The document then focuses on modal logics, defining their language and providing details on their model theory using possible world semantics. It discusses truth in possible worlds and models. It also describes several axiomatic modal systems and the relationships between them, and examines the classes of models validated by different axioms.
Hill climbing is a heuristic search algorithm used to find optimal solutions to mathematical problems. It works by starting with an initial solution and iteratively moving to a neighboring solution that improves the value of an objective function until a local optimum is reached. However, hill climbing may not find the global optimum solution and can get stuck in local optima. Variants include simple hill climbing, steepest ascent hill climbing, and stochastic hill climbing.
The document discusses recursion and lists in Prolog. It covers recursive definitions, clause ordering and termination, lists, members, and recursing down lists. Recursive definitions allow a predicate to refer to itself in its own definition. Clause ordering can impact termination and procedural meaning. Lists are a fundamental recursive data structure in Prolog, consisting of elements separated by commas and enclosed in brackets. The member predicate checks if an element is in a list by recursively working down the list. Recursing down lists is common for tasks like comparing lists or copying elements between lists.
The DENCLUE algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Observations going to the same local maximum are put into the same cluster. Clearly, DENCLUE doesn't work on data with uniform distribution.
The document discusses local search algorithms for optimization problems, including hill climbing, simulated annealing, and Tabu search. Hill climbing performs a local search by iteratively moving to neighbor states with improved cost until a local optimum is reached. Simulated annealing allows some "bad" moves with decreasing probability to help escape local optima. Tabu search uses a tabu list to avoid getting stuck in cycles and encourages exploring new regions of the search space. These local search methods are suitable for problems where the solution is the goal state itself rather than the path to get there.
Genetic algorithms are a type of artificial intelligence search technique inspired by natural selection. They work by randomly generating an initial population of solutions, evaluating their fitness, then breeding new solutions through selection, crossover and mutation over many generations until an optimal solution is found. Some key steps include randomly initializing a population, determining fitness, selecting parents, performing crossover on parents to create new solutions, mutating new solutions, determining fitness of new population, and repeating until a stopping criteria is met such as a good enough solution being found. Genetic algorithms have been applied to many optimization and search problems across various domains.
A Primality test is an algorithm for determining whether an input number is Prime. Among other fields of mathematics, it is used for Cryptography. Factorization is thought to be a computationally difficult problem, whereas primality testing is comparatively easy (its running time is polynomial in the size of the input).
RABIN KARP algorithm with hash function and hash collision, analysis, algorithm and code for implementation. Besides it contains applications of RABIN KARP algorithm also
The document discusses Diffie-Hellman key exchange, which is the first public key algorithm published in 1976. It allows two parties that have no prior knowledge of each other to jointly establish a shared secret key over an insecure communications channel. This key can then be used to encrypt subsequent communications using a symmetric key cipher. The security of the algorithm relies on the difficulty of solving the discrete logarithm problem in finite fields.
Semantic nets are a knowledge representation scheme that uses nodes and labeled directed arcs to encode knowledge. Nodes represent objects, concepts, and events, while arcs represent relationships between nodes. Frames are a similar representation that uses slots and fillers to represent entities and their attributes. Both semantic nets and frames allow for inheritance of properties along relationships. More expressive description logics were later developed that combine frame-like representations with formal semantics and classification capabilities. Large knowledge bases like CYC have been created using these representations to encode common-sense knowledge.
This document discusses message authentication techniques including message encryption, message authentication codes (MACs), and hash functions. It describes how each technique can be used to authenticate messages and protect against various security threats. It also covers how symmetric and asymmetric encryption can provide authentication when used with MACs or digital signatures. Specific MAC and hash functions are examined like HMAC, SHA-1, and SHA-2. X.509 is introduced as a standard for digital certificates.
This document discusses and compares several algorithms for string matching:
1. The naive algorithm compares characters one by one and has O(mn) runtime, where m and n are the lengths of the pattern and text.
2. Rabin-Karp uses hashing to compare substrings, running in O(m+n) time. It calculates hash values for the pattern and text substrings.
3. Knuth-Morris-Pratt improves on naive by using the prefix function to avoid re-checking characters, running in O(m+n) time. It constructs a state machine from the pattern to skip matching.
The document summarizes the Advanced Encryption Standard (AES). It describes how AES was selected by NIST as a replacement for DES. AES (Rijndael cipher) uses a block size of 128 bits, with key sizes of 128, 192, or 256 bits. It operates on data in rounds that include byte substitution, shifting rows, mixing columns, and adding the round key. The key is expanded into an array of words used for each round.
This document discusses the concept of data in data mining. It defines data as a collection of objects and their attributes. Attributes describe objects and can take on attribute values. Attributes can be nominal, ordinal, interval or ratio depending on their properties. Datasets can consist of records, documents, transactions or graphs. The document also discusses data quality issues like noise, outliers, missing values and duplicates. Finally, it covers preprocessing techniques like aggregation, sampling, dimensionality reduction and discretization.
This document defines key concepts related to data in data mining. It discusses what data is, the different types of attributes and attribute values, different data types including nominal, ordinal, interval and ratio, and examples of different data sets such as records, documents, transactions and graphs. It also covers topics such as data quality issues including noise, outliers, missing values and duplicates.
Means-Ends Analysis
Ways to play
Game trees
Game Tree and Heuristic Evaluation
Minimax Evaluation of Game Trees
Minimax with Alpha-Beta Pruning
Game tree numericals
This document discusses handling uncertainty through probabilistic reasoning and machine learning techniques. It covers sources of uncertainty like incomplete data, probabilistic effects, and uncertain outputs from inference. Approaches covered include Bayesian networks, Bayes' theorem, conditional probability, joint probability distributions, and Dempster-Shafer theory. It provides examples of calculating conditional probabilities and using Bayes' theorem. Bayesian networks are defined as directed acyclic graphs representing probabilistic dependencies between variables, and examples show how to represent domains of uncertainty and perform probabilistic reasoning using a Bayesian network.
This lecture is all about General problem Solver, a universal Problem Solving Machine using Same Base Algorithm.
and is for BS computer Science Students.
it is only for learning purpose, is not that much professional, there may be errors or mistakes, therefore corrections and suggestions are welcome.
The document discusses different knowledge representation schemes used in artificial intelligence systems. It describes semantic networks, frames, propositional logic, first-order predicate logic, and rule-based systems. For each technique, it provides facts about how knowledge is represented and examples to illustrate their use. The goal of knowledge representation is to encode knowledge in a way that allows inferencing and learning of new knowledge from the facts stored in the knowledge base.
This document discusses modal logics and formalisms. It defines modal logics as logics that add new logical constants like necessity (□) and possibility (◇) to classical logic. It describes how modal logics can be classified based on whether they are extended logics that add new well-formed formulas or deviant logics that interpret the usual logical constants differently. The document then focuses on modal logics, defining their language and providing details on their model theory using possible world semantics. It discusses truth in possible worlds and models. It also describes several axiomatic modal systems and the relationships between them, and examines the classes of models validated by different axioms.
Hill climbing is a heuristic search algorithm used to find optimal solutions to mathematical problems. It works by starting with an initial solution and iteratively moving to a neighboring solution that improves the value of an objective function until a local optimum is reached. However, hill climbing may not find the global optimum solution and can get stuck in local optima. Variants include simple hill climbing, steepest ascent hill climbing, and stochastic hill climbing.
The document discusses recursion and lists in Prolog. It covers recursive definitions, clause ordering and termination, lists, members, and recursing down lists. Recursive definitions allow a predicate to refer to itself in its own definition. Clause ordering can impact termination and procedural meaning. Lists are a fundamental recursive data structure in Prolog, consisting of elements separated by commas and enclosed in brackets. The member predicate checks if an element is in a list by recursively working down the list. Recursing down lists is common for tasks like comparing lists or copying elements between lists.
The DENCLUE algorithm employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Observations going to the same local maximum are put into the same cluster. Clearly, DENCLUE doesn't work on data with uniform distribution.
The document discusses local search algorithms for optimization problems, including hill climbing, simulated annealing, and Tabu search. Hill climbing performs a local search by iteratively moving to neighbor states with improved cost until a local optimum is reached. Simulated annealing allows some "bad" moves with decreasing probability to help escape local optima. Tabu search uses a tabu list to avoid getting stuck in cycles and encourages exploring new regions of the search space. These local search methods are suitable for problems where the solution is the goal state itself rather than the path to get there.
Genetic algorithms are a type of artificial intelligence search technique inspired by natural selection. They work by randomly generating an initial population of solutions, evaluating their fitness, then breeding new solutions through selection, crossover and mutation over many generations until an optimal solution is found. Some key steps include randomly initializing a population, determining fitness, selecting parents, performing crossover on parents to create new solutions, mutating new solutions, determining fitness of new population, and repeating until a stopping criteria is met such as a good enough solution being found. Genetic algorithms have been applied to many optimization and search problems across various domains.
A Primality test is an algorithm for determining whether an input number is Prime. Among other fields of mathematics, it is used for Cryptography. Factorization is thought to be a computationally difficult problem, whereas primality testing is comparatively easy (its running time is polynomial in the size of the input).
RABIN KARP algorithm with hash function and hash collision, analysis, algorithm and code for implementation. Besides it contains applications of RABIN KARP algorithm also
The document discusses Diffie-Hellman key exchange, which is the first public key algorithm published in 1976. It allows two parties that have no prior knowledge of each other to jointly establish a shared secret key over an insecure communications channel. This key can then be used to encrypt subsequent communications using a symmetric key cipher. The security of the algorithm relies on the difficulty of solving the discrete logarithm problem in finite fields.
Semantic nets are a knowledge representation scheme that uses nodes and labeled directed arcs to encode knowledge. Nodes represent objects, concepts, and events, while arcs represent relationships between nodes. Frames are a similar representation that uses slots and fillers to represent entities and their attributes. Both semantic nets and frames allow for inheritance of properties along relationships. More expressive description logics were later developed that combine frame-like representations with formal semantics and classification capabilities. Large knowledge bases like CYC have been created using these representations to encode common-sense knowledge.
This document discusses message authentication techniques including message encryption, message authentication codes (MACs), and hash functions. It describes how each technique can be used to authenticate messages and protect against various security threats. It also covers how symmetric and asymmetric encryption can provide authentication when used with MACs or digital signatures. Specific MAC and hash functions are examined like HMAC, SHA-1, and SHA-2. X.509 is introduced as a standard for digital certificates.
This document discusses and compares several algorithms for string matching:
1. The naive algorithm compares characters one by one and has O(mn) runtime, where m and n are the lengths of the pattern and text.
2. Rabin-Karp uses hashing to compare substrings, running in O(m+n) time. It calculates hash values for the pattern and text substrings.
3. Knuth-Morris-Pratt improves on naive by using the prefix function to avoid re-checking characters, running in O(m+n) time. It constructs a state machine from the pattern to skip matching.
The document summarizes the Advanced Encryption Standard (AES). It describes how AES was selected by NIST as a replacement for DES. AES (Rijndael cipher) uses a block size of 128 bits, with key sizes of 128, 192, or 256 bits. It operates on data in rounds that include byte substitution, shifting rows, mixing columns, and adding the round key. The key is expanded into an array of words used for each round.
This document discusses the concept of data in data mining. It defines data as a collection of objects and their attributes. Attributes describe objects and can take on attribute values. Attributes can be nominal, ordinal, interval or ratio depending on their properties. Datasets can consist of records, documents, transactions or graphs. The document also discusses data quality issues like noise, outliers, missing values and duplicates. Finally, it covers preprocessing techniques like aggregation, sampling, dimensionality reduction and discretization.
This document defines key concepts related to data in data mining. It discusses what data is, the different types of attributes and attribute values, different data types including nominal, ordinal, interval and ratio, and examples of different data sets such as records, documents, transactions and graphs. It also covers topics such as data quality issues including noise, outliers, missing values and duplicates.
This document provides an overview of key concepts related to data and data preprocessing. It defines data as a collection of objects and their attributes. Attributes can be nominal, ordinal, interval, or ratio. Data can take the form of records, graphs, ordered sequences, or other types. The document discusses attribute values, data quality issues like noise, outliers, and missing values. It also covers common preprocessing techniques like aggregation, sampling, dimensionality reduction, feature selection and creation, and discretization. Finally, it introduces concepts of similarity and dissimilarity measures between data objects.
Data mining Basics and complete description Sulman Ahmed
This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results.This course is all about the data mining techniques and how we mine the data and get optimize results
The document discusses classification, which involves using a training dataset containing records with attributes and classes to build a model that can predict the class of previously unseen records. It describes dividing the dataset into training and test sets, with the training set used to build the model and test set to validate it. Decision trees are presented as a classification method that splits records recursively into partitions based on attribute values to optimize a metric like GINI impurity, allowing predictions to be made by following the tree branches. The key steps of building a decision tree classifier are outlined.
This document provides an overview of getting to know data through data mining and data warehousing. It defines key concepts like data objects, attributes, attribute types, data sets, and data quality issues. Data objects are described by a set of attributes, which can be qualitative like nominal or ordinal, or quantitative like interval or ratio scaled. Different types of data sets are discussed including data matrices, documents, transactions, graphs, and ordered data. Common data quality problems addressed are noise, outliers, missing values, and duplicate data. Methods for measuring similarity and dissimilarity between data objects are also introduced.
Data Mining DataLecture Notes for Chapter 2Introduc.docxwhittemorelucilla
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Measurement of Length The way you measure an attribute is somewhat may not match the attributes properties.
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An .
This document provides an overview of different types of data that can be analyzed using data mining and machine learning techniques. It discusses record data, data matrices, document data, transaction data, graph data, ordered data, and more. It also covers important data quality issues like noise, outliers, missing values, and duplicate data. Common data preprocessing techniques are explained such as aggregation, sampling, dimensionality reduction, feature selection and creation, and attribute transformation. Finally, measures of similarity and dissimilarity between data objects are introduced, including Euclidean distance and Minkowski distance.
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining
by
Tan, Steinbach, Kumar
What is Data?Collection of data objects and their attributes
An attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instance
Attributes
Objects
Attribute ValuesAttribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute values can be different
ID has no limit but age has a maximum and minimum value
Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, time, counts
Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses:Distinctness: = Order: < > Addition: + - Multiplication: * /
Nominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & additionRatio attribute: all 4 properties
Attribute Type
Description
Examples
Operations
Nominal
The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, )
zip codes, employee ID numbers, eye color, sex: {male, female}
mode, entropy, contingency correlation, 2 test
Ordinal
The values of an ordinal attribute provide enough information to order objects. (<, >)
hardness of minerals, {good, better, best},
grades, street numbers
median, percentiles, rank correlation, run tests, sign tests
Interval
For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.
(+, - )
calendar dates, temperature in Celsius or Fahrenheit
mean, standard deviation, Pearson's correlation, t and F tests
Ratio
For ratio variables, both differences and ratios are meaningful. (*, /)
temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current
geometric mean, harmonic mean, percent variation
Attribute Level
Transformation
Comments
Nominal
Any permutation of values
If all employee ID numbers were reassigned, would it make any difference?
Ordinal
An order preserving change of values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribut ...
The document provides an overview of data mining, including:
1) It defines data mining as the process of extracting patterns from large datasets that are valid, novel, useful and understandable.
2) It discusses some of the challenges of data mining like dealing with noise and missing data and not overfitting models.
3) It outlines several common data mining tasks like classification, clustering, association rule mining and sequential pattern mining.
This document provides an introduction to data mining. It discusses why organizations mine data from both a commercial and scientific viewpoint. Large amounts of data are being collected but not fully analyzed. Data mining can help discover useful patterns and information that is hidden within large datasets. The document defines data mining and differentiates it from simple queries. It outlines some common data mining tasks like classification, clustering, association rule mining, and their applications. Overall, the document serves as a high-level overview of the key concepts and motivations behind data mining.
This document discusses data objects, attributes, and data types. It begins by defining a data object as an entity with attributes that describe its characteristics. Attributes can be nominal, ordinal, interval, ratio, discrete, or continuous. The document then discusses different types of data structures like records, graphs, ordered data, and more. It also covers measuring similarity and dissimilarity between data objects using distances and properties of good distance measures. In summary, the document provides an overview of fundamental concepts in data including objects, attributes, data types, structures, and measuring similarity.
4. Attribute Description Examples Operations
Type
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, χ2 test
information to distinguish one object
from another. (=, ≠)
Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,
provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard
differences between values are temperature in Celsius deviation, Pearson's
meaningful, i.e., a unit of or Fahrenheit correlation, t and F
measurement exists. tests
(+, - )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Attribute Transformation Comments
Level
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal An order preserving change of An attribute encompassing
values, i.e., the notion of good, better
new_value = f(old_value) best can be represented
where f is a monotonic function. equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b Thus, the Fahrenheit and
where a and b are constants Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.