SlideShare a Scribd company logo
1 of 39
Data for Data Mining
What is Data? Data is a Collection of data objects and their attributes An attribute is a property or characteristic of an object Ex:        eye color of a person,temperature etc. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance
Types of Attributes  There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts
Attribute Type Description Examples Operations Nominal nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rankcorrelation run & sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists.  (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
Discrete  Attribute Has only a finite or countable infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents  Often represented as integer variables.    Note: binary attributes are a special case of discrete attributes
Continuous Attribute 	Has real numbers as attribute values Examples: temperature, height, or weight.   Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
Types of data sets  Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures
Contd… Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data
Important Characteristics of Structured Data Dimensionality  Curse of Dimensionality Sparsity  Only presence counts Resolution  Patterns depend on the scale
Data Quality  What kinds of data quality problems? How can we detect problems with the data?  What can we do about these problems?
Examples Examples of data quality problems:  Noise and outliers  missing values  duplicate data
Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation
Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction  Reduce the number of attributes or objects Change of scale  Cities aggregated into regions, states, countries, etc More “stable” data  Aggregated data tends to have less variability
Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample.      In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition
Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise
Techniques Techniques Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques
Feature Subset Selection Another way to reduce dimensionality of data Redundant features  duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA
Techniques Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches:  Feature selection occurs naturally as part of the data mining algorithm Filter approaches:  Features are selected before data mining algorithm is run Wrapper approaches:  Use the data mining algorithm as a black box to find best subset of attributes
Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction  domain-specific Mapping Data to New Space Feature Construction  combining features
Discretization and Binarization Transforming  a  continuous  attribute into a categorical attribute  is  know as  Discretization. Transforming  both continuous and  discrete attributes into one  o r more  binary  attribute  is  known  as  Binarization.
Techniques original data Equal width discretization Equal frequency discretization K-Means Discretization
Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Standardization and Normalization
Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
Dissimilarities between Data Objects DISTANCES Euclidean Distance Minkowski  Distance
Eulidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.
Minkowski Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Similarities between Data Objects Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p= q.  s(p, q) = s(q, p)   for all p and q. (Symmetry) 	where s(p, q) is the similarity between points (data objects), p and q.
Similarity Between Binary Vectors Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities 	M01= the number of attributes where p was 0 and q was 1 	M10 = the number of attributes where p was 1 and q was 0 	M00= the number of attributes where p was 0 and q was 0 	M11= the number of attributes where p was 1 and q was 1
Contd…. Simple Matching and Jaccard Coefficients  	SMC =  number of matches / number of attributes           		 =  (M11 + M00) / (M01 + M10 + M11 + M00) 	J = number of 11 matches / number of not-both-zero attributes values    	   = (M11) / (M01 + M10 + M11)
Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) =  (d1d2) / ||d1|| ||d2|| ,     where  indicates vector dot product and || d || is  the   length of vector d.
Example d1=  3 2 0 5 0 0 0 2 0 0 	    	d2 =  1 0 0 0 0 0 0 1 0 2     d1d2=  3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5    ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =  (42) 0.5 = 6.481     ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5 = 2.245 cos( d1, d2 ) = .3150
Extended Jaccard Coefficient (Tanimoto) Variation of Jaccard for continuous or count attributes Reduces to Jaccard for binary attributes
Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product
General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed.
Contd….
Using Weights to Combine Similarities         May not want to treat all attributes the same. Use weights wk which are between 0 and 1 and sum to 1.
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

What's hot

Data mining Concepts and Techniques
Data mining Concepts and Techniques Data mining Concepts and Techniques
Data mining Concepts and Techniques Justin Cletus
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Handling noisy data
Handling noisy dataHandling noisy data
Handling noisy dataVivek Gandhi
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingAmuthamca
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 

What's hot (11)

Data mining Concepts and Techniques
Data mining Concepts and Techniques Data mining Concepts and Techniques
Data mining Concepts and Techniques
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Data preparation
Data preparationData preparation
Data preparation
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
4 preprocess
4 preprocess4 preprocess
4 preprocess
 
Handling noisy data
Handling noisy dataHandling noisy data
Handling noisy data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Lect4
Lect4Lect4
Lect4
 
Data preprocess
Data preprocessData preprocess
Data preprocess
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 

Viewers also liked

Viewers also liked (7)

Approaches to Mining Large-Scale Heterogeneous Data: Old and New
Approaches to Mining Large-Scale Heterogeneous Data: Old and NewApproaches to Mining Large-Scale Heterogeneous Data: Old and New
Approaches to Mining Large-Scale Heterogeneous Data: Old and New
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Text mining and data integration
Text mining and data integrationText mining and data integration
Text mining and data integration
 
Data mining
Data miningData mining
Data mining
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Sai kiran goud sem.ppt
Sai kiran goud sem.pptSai kiran goud sem.ppt
Sai kiran goud sem.ppt
 
Data mining
Data miningData mining
Data mining
 

Similar to Data For Datamining

Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxwhittemorelucilla
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityRushali Deshmukh
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessingKrish_ver2
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptRevathy V R
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingJames Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingFraboni Ec
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHoang Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingYoung Alista
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).pptMdZahidHasan55
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptSubrata Kumer Paul
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ngsaranya12345
 

Similar to Data For Datamining (20)

Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docx
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Wk. 3. Data [12-05-2021] (2).ppt
Wk. 3.  Data [12-05-2021] (2).pptWk. 3.  Data [12-05-2021] (2).ppt
Wk. 3. Data [12-05-2021] (2).ppt
 
Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptChapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data Mining Lecture_5.pptx
Data Mining Lecture_5.pptxData Mining Lecture_5.pptx
Data Mining Lecture_5.pptx
 
Data Types
Data TypesData Types
Data Types
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

More from Datamining Tools

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDatamining Tools
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysisDatamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDatamining Tools
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDatamining Tools
 
Data Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyData Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyDatamining Tools
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processingDatamining Tools
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDatamining Tools
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and PredictionDatamining Tools
 
Data Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisData Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisDatamining Tools
 
Data Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsData Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsDatamining Tools
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationDatamining Tools
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data miningDatamining Tools
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDatamining Tools
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceDatamining Tools
 

More from Datamining Tools (20)

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technologyData Mining: Data warehouse and olap technology
Data Mining: Data warehouse and olap technology
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Data mining: Classification and Prediction
Data mining: Classification and PredictionData mining: Classification and Prediction
Data mining: Classification and Prediction
 
Data Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysisData Mining: Data mining classification and analysis
Data Mining: Data mining classification and analysis
 
Data Mining: Data mining and key definitions
Data Mining: Data mining and key definitionsData Mining: Data mining and key definitions
Data Mining: Data mining and key definitions
 
Data Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalizationData Mining: Data cube computation and data generalization
Data Mining: Data cube computation and data generalization
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI  2AI: Learning in AI  2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Data For Datamining

  • 1. Data for Data Mining
  • 2. What is Data? Data is a Collection of data objects and their attributes An attribute is a property or characteristic of an object Ex: eye color of a person,temperature etc. A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance
  • 3. Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio Examples: temperature in Kelvin, length, time, counts
  • 4. Attribute Type Description Examples Operations Nominal nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rankcorrelation run & sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
  • 5. Discrete Attribute Has only a finite or countable infinite set of values Examples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes
  • 6. Continuous Attribute Has real numbers as attribute values Examples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits. Continuous attributes are typically represented as floating-point variables.
  • 7. Types of data sets Record Data Matrix Document Data Transaction Data Graph World Wide Web Molecular Structures
  • 8. Contd… Ordered Spatial Data Temporal Data Sequential Data Genetic Sequence Data
  • 9. Important Characteristics of Structured Data Dimensionality Curse of Dimensionality Sparsity Only presence counts Resolution Patterns depend on the scale
  • 10. Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems?
  • 11. Examples Examples of data quality problems: Noise and outliers missing values duplicate data
  • 12. Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization and Binarization Attribute Transformation
  • 13. Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More “stable” data Aggregated data tends to have less variability
  • 14. Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.
  • 15. Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once Stratified sampling Split the data into several partitions; then draw random samples from each partition
  • 16. Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise
  • 17. Techniques Techniques Principle Component Analysis Singular Value Decomposition Others: supervised and non-linear techniques
  • 18. Feature Subset Selection Another way to reduce dimensionality of data Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA
  • 19. Techniques Brute-force approach: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes
  • 20. Feature Creation Create new attributes that can capture the important information in a data set much more efficiently than the original attributes Three general methodologies: Feature Extraction domain-specific Mapping Data to New Space Feature Construction combining features
  • 21. Discretization and Binarization Transforming a continuous attribute into a categorical attribute is know as Discretization. Transforming both continuous and discrete attributes into one o r more binary attribute is known as Binarization.
  • 22. Techniques original data Equal width discretization Equal frequency discretization K-Means Discretization
  • 23. Attribute Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values Simple functions: xk, log(x), ex, |x| Standardization and Normalization
  • 24. Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
  • 25. Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
  • 26. Dissimilarities between Data Objects DISTANCES Euclidean Distance Minkowski Distance
  • 27. Eulidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ.
  • 28. Minkowski Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
  • 29. Similarities between Data Objects Similarities, also have some well known properties. s(p, q) = 1 (or maximum similarity) only if p= q. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.
  • 30. Similarity Between Binary Vectors Common situation is that objects, p and q, have only binary attributes Compute similarities using the following quantities M01= the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00= the number of attributes where p was 0 and q was 0 M11= the number of attributes where p was 1 and q was 1
  • 31. Contd…. Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)
  • 32. Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d.
  • 33. Example d1= 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6) 0.5 = 2.245 cos( d1, d2 ) = .3150
  • 34. Extended Jaccard Coefficient (Tanimoto) Variation of Jaccard for continuous or count attributes Reduces to Jaccard for binary attributes
  • 35. Correlation Correlation measures the linear relationship between objects To compute correlation, we standardize data objects, p and q, and then take their dot product
  • 36. General Approach for Combining Similarities Sometimes attributes are of many different types, but an overall similarity is needed.
  • 38. Using Weights to Combine Similarities May not want to treat all attributes the same. Use weights wk which are between 0 and 1 and sum to 1.
  • 39. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net