SlideShare a Scribd company logo
INTRODUCTION 1
 Extraction or ‘mining’ of large amount of data
 Also known as knowledge mining from data / knowledge extraction / data or
pattern analysis / data archaeology / data dredging
 Most popular – Knowledge Discovery from Data (KDD)
 Data available in huge amount -> Imminent need for turning into useful info
 Application – market analysis, fraud detection, customer retention, production
control, science exploration
2
 Data cleaning (remove noise and inconsistent data)
 Data integration (combine multiple data sources)
 Data selection (relevant data is retrieved from database)
 Data transformation (data is transformed or consolidated by mining/aggregation)
 Data mining (extraction of data patterns)
 Pattern evaluation (identifying interesting patterns representing knowledge using
interestingness measures)
 Knowledge presentation (visualization and presentation of mined knowledge)
3
4
 Database, Data Warehouses, WWW, Information Repositories – It may be a set of
databases/warehouses or any other information repositories. Data cleaning and
data integration is performed.
 Database / Data Warehouse servers – responsible for fetching relevant data based
on user’s request
 Knowledge base – it’s the domain knowledge that guides the search. Includes
concept hierarchies used to organize attributes, user believes
 Data mining engine – consist of functional modules for task such as
characterization, association and correlation analysis, classification, prediction,
cluster analysis, outlier analysis.
 Pattern evaluation module – employs interestingness measures and interactive
with data mining modules to focus the search towards interesting patterns
 User interface – user specifies a data mining query or task, providing information
to help focus search and perform exploratory data mining based on intermediate
data mining results.
5
 Relational Databases
 Data Warehouses
 Transactional Databases
 Advanced Data and Information Systems and Advanced Application
 Object-Relational Database
 Temporal Database/Sequence Database and Time-Series Database
 Spatial Databases and Spatiotemporal Databases
 Text Databases and Multimedia Databases
 Heterogeneous Databases and Legacy Databases
 Data Streams
 World Wide Web
6
 No coupling
 DM system does not utilize any function of DB/DW.
 Fetches data from source and stores result in different file
 Drawbacks
 Without a DB system, a DM system spends time in searching, collecting, transforming data.
 DM systems doesn’t have any tested, scalable algorithm or data structure implemented
 DM systems needs another tool to extract data
 Loose coupling
 DM system will use some feature of DB system like fetching data, performing data
mining and storing the results in a file/place in database
 Advantage
 Fetch data from database using query processing, indexing
 Has advantages of flexibility, efficiency by the system.
 Disadvantage – mining does not explore data structure/query optimization methods 7
 Semi-tight coupling
 Linking of DM system to DB system and efficient implementation of a few essential data
mining primitives is provided by DB
 Includes sorting, indexing, aggregation, histogram analysis, pre-computation of
statistical measures like sum, count, min-max, standard deviation
 Enhances performance of DM system since some frequently used results is pre-computed
 Tight coupling
 DM system is smoothly integrated into DB system.
 data mining queries and functionalities are optimized based on mining query analysis,
data structure, indexing schemes and query processing methods.
8
 Why preprocess the data?
 Incomplete (lacking attribute values)
 Noisy (containing errors or outliers)
 Inconsistent (containing discrepancies in department codes used to categorize them)
 Redundancy (repetition of the same data)
 Descriptive Data Summarization helps in the study of general characteristics of
the data and identifies the presence of noise or outliers which is useful for
successful for cleaning and data integration.
 Measures of central tendency – mean, median, weighted arithmetic mean, mode
 Measure of data dispersion – quartiles, interquartile range, variance
9
 A distributive measure is a measure that can be computed for a given data set by
partitioning the data into smaller subsets, computing the measure for each subset
and then merging the result in order to arrive at the measure’s value for the
original dataset.
 An algebraic measure is a measure that can be computed by applying an algebraic
function to one or more distributive measures.
 A holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.
10
 The degree to which the numerical data tend to spread is called dispersion or
variance of the data.
 Most common measure of dispersion are range, five-number summary, inter
quartile range, standard deviation.
 For displaying the data summary and dispersion popular graphs include –
histograms, quantile plots, q-q plots, scatter plots, loess curves.
11
12
 Data cleaning tends to fill missing values, smooth out noise, identify outliers,
correct inconsistencies
 Missing values
 Ignore the tuple
 Fill the missing value manually
 Use a global constant to fill the missing value
 Use the architecture mean to fill the missing value
 Use the attribute mean for samples belonging to the same class as the given tuple
 Use the most probable value to fill the missing value
 Use regression, decision-tree induction, Bayesian formation
13
 Noisy data
 Binning
 Consults the neighboring value
 Performs local smoothing
 Smoothing by bin means – each value of bin is replaced by mean value of the bin
 Smoothing by bin median – each value of the bin is replaced by bin median
 smoothing by bin boundaries – max and min value of bin is bin boundary and each value of
bin is replaced by the closest bin boundary
 Regression
 Filters the data into functions
 Linear regression finds the best line to fit two attributes
 Multiple regression involves more than two variables
 Clustering
 Outliers is detected through clustering where similar values are organized into clusters
 Values falling off the set is outlier
14
 Data integration
 Entity identification problem is matching of equivalent real-world entries from multiple
data sources
 Correlation analysis measures how strong one attribute implies the other
 Data transformation
 Smoothing – binning, regression, clustering
 Aggregation
 Generalization – low level data is replaced by higher level concept through the use of
concept hierarchy
 Normalization – data is scaled to fall within a small specified range
 Min-Max method
 Z-score normalization
 Decimal scaling
 Attribute construction
15
 Applied to obtain a reduced representation of data set
 Data cube aggregation
 Attribute subset selection reduces the data size by removing irrelevant or redundant
attribute.
 Dimensionality reduction involves data encoding or transformation to obtain
compressed data. Lossy dimensionality reduction – wavelet transform, principal
component analysis
 Numerosity reduction
 Parametric methods use a model to estimate data ex. Log-Linear model
 Nonparametric method include histogram, clustering and sample for storing reduced
representation
 Discretization and concept hierarchy reduces the number of values for a given attribute
by dividing the range of the attribute into intervals.
16
DM task is divided into two categories: descriptive and predictive
Descriptive mining task characterizes general properties of the data
Predictive mining task performs inference on current data in order to make predictions
17
 Data characterization is summarization of the general characterization or
features of the target class of data.
 Data corresponding to user specific class are typically collected by database query
 Example: to study the characteristics of software products whose sales increased by 10%,
data related to the product is collected
 Data cube OLAP roll-up operation is used for data summarization
 Output is presented in the form of pie charts, histogram
 Data discrimination is comparison of the general features of target class data
objects with general features of the object from one or a set of contrasting class.
 Example: comparison of a product whose sales increased by 10% with that of a product
whose sales decreased by 30%
18
 Classification is the process of finding a model that describes or distinguishes data
classes or concepts for the purpose of being able to use the model to predict the
class of object whose class label is unknown.
 Classifying loan as ‘safe’ or ‘risky’
 Given a customer profile, guess whether he will buy a new computer
 Decision tree induction
 Bayesian classification
 Rule-based classification
 Classification by backpropogation
 Support vector machines
 Classification by association rule analysis
19
 Prediction models continuous valued functions. Numeric prediction is the task of
predicting continues values for the given input.
 Regression analysis is a statistical methodology that is often used for numerical
prediction
 Linear/straight-line regression involves a response variable, y and a single
predictor variable, x. It models y as a function of x. [y=b+wx]
 Multiple linear regression extends straight-line regression to models more than
one predictor variable
 Nonlinear regression models polynomial terms
20
 The process of grouping a set of physical or abstract objects into classes of similar
objects is called clustering.
 A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other cluster.
 Class labels are not present in training data because they are not known to begin
with. Clustering is used to generate such labels
 Applications: taxonomy (organization of observations into hierarchy of classes that
group similar events together)
21
 Partitioning method
 Partitioning method creates k partitions of the database of n objects of data tuples
 Requirements
 Each group must contain at least one object
 Each object must belong to exactly one group
 Objects in the same cluster are close or related to each other whereas objects of different
cluster are fat apart or very different
 k-means algorithm where each cluster is represented by the mean value of the objects
 k-medoids algorithm where each cluster is represented by one of the objects located near
the center of the cluster.
 works well for small to medium databases
22
 Hierarchical method
 Created hierarchical decomposition of the given set of data objects.
 Classification based on how hierarchical decomposition is formed
 Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until
all the groups are merged into one
 Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into
smaller cluster until eventually each object is in one cluster
 Density-based method
 Can easily determine clusters of arbitrary shape
 Used to filter out noise
 Grid based method
 Quantize the object space into a finite number of cells that form a grid structure.
 Faster processing
 Model based clustering
 Hypothesizes a model for each cluster and finds the best fit of the data to the given
model
 Locates cluster by constructing a density function that reflects spatial distribution of
data
 Automatically determines the number of clusters based on standard statistics
 Example: self organizing maps
23
 Clustering high dimensional data
 examines objects having a number of features
 Subspace clustering method searches for clusters in subspace
 Frequent pattern based clustering extracts distinct frequent patterns among subset of
dimensions that occur frequently
 Constrain based clustering
 Performs clustering by incorporating user-specific constrains
 A constrain expresses a user’s expectations or desired results
 Example: spatial clustering with the existence of obstacles and clustering under user
specific constrains
24
 Outliers are data that do not comply with the general behavior or model of data
 Its discarded by most data mining applications. However, in applications like
fraud detection, it worth noting. Example: fraudulent usage of credit cards by
detecting purchases extremely of extremely large amount on a given day
 Outliers may be detected by using a statistical test for probability model or using
distance measure where objects that are a substantial distance from any other
cluster is considered outlier.
 Evolution analysis describes and models regularities or trends for objects whose
behavior changes over time.
25
 Massive data, temporally ordered, fast changing and potentially infinite is stream
data.
 Stream data flow in and out of a computer system continuously and with varying
update rates.
 Examples – real-time surveillance system, communication network, internet
traffic, on-line transactions in financial markets or retail industry, electric power
grids, industry production process and other dynamic environments.
 It is impossible to store an entire data stream. Moreover, it tends to be of rather
low level of abstraction.
26
 Mining time-series data
 A time-series database consist of sequence of values spread over repeated measurements
of time.
 Time-series database is popular in stock-market analysis, economic and sales
forecasting, budgetary analysis, utility studies, yield studies, work-load projections,
observation of natural phenomenon
 Mining sequence patterns
 A sequence database consist of sequence of ordered elements or events, recorded with or
without a concrete notion of time. Sequential pattern mining is the discovery of
frequently occurring ordered events or sequence of patterns.
 Applications include customer shopping sequence, web clickstream, biological sequences,
sequences of events in science and engineering.
27

More Related Content

What's hot

All types of model(Simulation & Modelling) #ShareThisIfYouLike
All types of model(Simulation & Modelling) #ShareThisIfYouLikeAll types of model(Simulation & Modelling) #ShareThisIfYouLike
All types of model(Simulation & Modelling) #ShareThisIfYouLike
United International University
 
Simulation concept, Advantages & Disadvantages
Simulation concept, Advantages & DisadvantagesSimulation concept, Advantages & Disadvantages
Simulation concept, Advantages & Disadvantages
Pankaj Verma
 
10 models to supplement use cases
10 models to supplement use cases10 models to supplement use cases
10 models to supplement use cases
Andreas Hägglund
 
Cocomo models
Cocomo modelsCocomo models
Cocomo models
minhasmushtaqbhatti
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introduction
SujaMaryD
 
Introduction iii
Introduction iiiIntroduction iii
Introduction iiichandsek666
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Software architecture and software design
Software architecture and software designSoftware architecture and software design
Software architecture and software design
Mr. Swapnil G. Thaware
 
Enterprise Networks for Connected Buildings
Enterprise Networks for Connected BuildingsEnterprise Networks for Connected Buildings
Enterprise Networks for Connected Buildings
Panduit
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Cs1011 dw-dm-1
Cs1011 dw-dm-1Cs1011 dw-dm-1
Cs1011 dw-dm-1
Aarti Goyal
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
ssuser23e4f31
 
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de venCreative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
PyData
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
Experfy
 
HCI 3e - Ch 9: Evaluation techniques
HCI 3e - Ch 9:  Evaluation techniquesHCI 3e - Ch 9:  Evaluation techniques
HCI 3e - Ch 9: Evaluation techniques
Alan Dix
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
John Mike
 

What's hot (20)

All types of model(Simulation & Modelling) #ShareThisIfYouLike
All types of model(Simulation & Modelling) #ShareThisIfYouLikeAll types of model(Simulation & Modelling) #ShareThisIfYouLike
All types of model(Simulation & Modelling) #ShareThisIfYouLike
 
Simulation concept, Advantages & Disadvantages
Simulation concept, Advantages & DisadvantagesSimulation concept, Advantages & Disadvantages
Simulation concept, Advantages & Disadvantages
 
Simulation
SimulationSimulation
Simulation
 
10 models to supplement use cases
10 models to supplement use cases10 models to supplement use cases
10 models to supplement use cases
 
Cocomo models
Cocomo modelsCocomo models
Cocomo models
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Unit i big data introduction
Unit  i big data introductionUnit  i big data introduction
Unit i big data introduction
 
Introduction iii
Introduction iiiIntroduction iii
Introduction iii
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Software architecture and software design
Software architecture and software designSoftware architecture and software design
Software architecture and software design
 
Enterprise Networks for Connected Buildings
Enterprise Networks for Connected BuildingsEnterprise Networks for Connected Buildings
Enterprise Networks for Connected Buildings
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Cs1011 dw-dm-1
Cs1011 dw-dm-1Cs1011 dw-dm-1
Cs1011 dw-dm-1
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de venCreative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
Creative Interactive Browser Visualizations with Bokeh by Bryan Van de ven
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
 
HCI 3e - Ch 9: Evaluation techniques
HCI 3e - Ch 9:  Evaluation techniquesHCI 3e - Ch 9:  Evaluation techniques
HCI 3e - Ch 9: Evaluation techniques
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
data warehousing
data warehousingdata warehousing
data warehousing
 

Similar to Introduction to data mining

data mining
data miningdata mining
data mining
manasa polu
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
Dr-Dipali Meher
 
Data Reduction
Data ReductionData Reduction
Data Reduction
Rajan Shah
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
Gokulks007
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
Harsha Patel
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
SamPrem3
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
Jayanti Pande
 
Data preperation
Data preperationData preperation
Data preperation
Hoang Nguyen
 
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 

Similar to Introduction to data mining (20)

1234
12341234
1234
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
data mining
data miningdata mining
data mining
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
 
UNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data MiningUNIT 2: Part 2: Data Warehousing and Data Mining
UNIT 2: Part 2: Data Warehousing and Data Mining
 
Unit II.pdf
Unit II.pdfUnit II.pdf
Unit II.pdf
 
Preprocess
PreprocessPreprocess
Preprocess
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 

More from Ujjawal

fMRI in machine learning
fMRI in machine learningfMRI in machine learning
fMRI in machine learning
Ujjawal
 
Random forest
Random forestRandom forest
Random forestUjjawal
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learningUjjawal
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalUjjawal
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmUjjawal
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
Ujjawal
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
Ujjawal
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Bayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionBayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionUjjawal
 

More from Ujjawal (10)

fMRI in machine learning
fMRI in machine learningfMRI in machine learning
fMRI in machine learning
 
Random forest
Random forestRandom forest
Random forest
 
Neural network for machine learning
Neural network for machine learningNeural network for machine learning
Neural network for machine learning
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Bayes’ theorem and logistic regression
Bayes’ theorem and logistic regressionBayes’ theorem and logistic regression
Bayes’ theorem and logistic regression
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 

Introduction to data mining

  • 2.  Extraction or ‘mining’ of large amount of data  Also known as knowledge mining from data / knowledge extraction / data or pattern analysis / data archaeology / data dredging  Most popular – Knowledge Discovery from Data (KDD)  Data available in huge amount -> Imminent need for turning into useful info  Application – market analysis, fraud detection, customer retention, production control, science exploration 2
  • 3.  Data cleaning (remove noise and inconsistent data)  Data integration (combine multiple data sources)  Data selection (relevant data is retrieved from database)  Data transformation (data is transformed or consolidated by mining/aggregation)  Data mining (extraction of data patterns)  Pattern evaluation (identifying interesting patterns representing knowledge using interestingness measures)  Knowledge presentation (visualization and presentation of mined knowledge) 3
  • 4. 4
  • 5.  Database, Data Warehouses, WWW, Information Repositories – It may be a set of databases/warehouses or any other information repositories. Data cleaning and data integration is performed.  Database / Data Warehouse servers – responsible for fetching relevant data based on user’s request  Knowledge base – it’s the domain knowledge that guides the search. Includes concept hierarchies used to organize attributes, user believes  Data mining engine – consist of functional modules for task such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis.  Pattern evaluation module – employs interestingness measures and interactive with data mining modules to focus the search towards interesting patterns  User interface – user specifies a data mining query or task, providing information to help focus search and perform exploratory data mining based on intermediate data mining results. 5
  • 6.  Relational Databases  Data Warehouses  Transactional Databases  Advanced Data and Information Systems and Advanced Application  Object-Relational Database  Temporal Database/Sequence Database and Time-Series Database  Spatial Databases and Spatiotemporal Databases  Text Databases and Multimedia Databases  Heterogeneous Databases and Legacy Databases  Data Streams  World Wide Web 6
  • 7.  No coupling  DM system does not utilize any function of DB/DW.  Fetches data from source and stores result in different file  Drawbacks  Without a DB system, a DM system spends time in searching, collecting, transforming data.  DM systems doesn’t have any tested, scalable algorithm or data structure implemented  DM systems needs another tool to extract data  Loose coupling  DM system will use some feature of DB system like fetching data, performing data mining and storing the results in a file/place in database  Advantage  Fetch data from database using query processing, indexing  Has advantages of flexibility, efficiency by the system.  Disadvantage – mining does not explore data structure/query optimization methods 7
  • 8.  Semi-tight coupling  Linking of DM system to DB system and efficient implementation of a few essential data mining primitives is provided by DB  Includes sorting, indexing, aggregation, histogram analysis, pre-computation of statistical measures like sum, count, min-max, standard deviation  Enhances performance of DM system since some frequently used results is pre-computed  Tight coupling  DM system is smoothly integrated into DB system.  data mining queries and functionalities are optimized based on mining query analysis, data structure, indexing schemes and query processing methods. 8
  • 9.  Why preprocess the data?  Incomplete (lacking attribute values)  Noisy (containing errors or outliers)  Inconsistent (containing discrepancies in department codes used to categorize them)  Redundancy (repetition of the same data)  Descriptive Data Summarization helps in the study of general characteristics of the data and identifies the presence of noise or outliers which is useful for successful for cleaning and data integration.  Measures of central tendency – mean, median, weighted arithmetic mean, mode  Measure of data dispersion – quartiles, interquartile range, variance 9
  • 10.  A distributive measure is a measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset and then merging the result in order to arrive at the measure’s value for the original dataset.  An algebraic measure is a measure that can be computed by applying an algebraic function to one or more distributive measures.  A holistic measure is a measure that must be computed on the entire data set as a whole. It cannot be computed by partitioning the given data into subsets and merging the values obtained for the measure in each subset. 10
  • 11.  The degree to which the numerical data tend to spread is called dispersion or variance of the data.  Most common measure of dispersion are range, five-number summary, inter quartile range, standard deviation.  For displaying the data summary and dispersion popular graphs include – histograms, quantile plots, q-q plots, scatter plots, loess curves. 11
  • 12. 12
  • 13.  Data cleaning tends to fill missing values, smooth out noise, identify outliers, correct inconsistencies  Missing values  Ignore the tuple  Fill the missing value manually  Use a global constant to fill the missing value  Use the architecture mean to fill the missing value  Use the attribute mean for samples belonging to the same class as the given tuple  Use the most probable value to fill the missing value  Use regression, decision-tree induction, Bayesian formation 13
  • 14.  Noisy data  Binning  Consults the neighboring value  Performs local smoothing  Smoothing by bin means – each value of bin is replaced by mean value of the bin  Smoothing by bin median – each value of the bin is replaced by bin median  smoothing by bin boundaries – max and min value of bin is bin boundary and each value of bin is replaced by the closest bin boundary  Regression  Filters the data into functions  Linear regression finds the best line to fit two attributes  Multiple regression involves more than two variables  Clustering  Outliers is detected through clustering where similar values are organized into clusters  Values falling off the set is outlier 14
  • 15.  Data integration  Entity identification problem is matching of equivalent real-world entries from multiple data sources  Correlation analysis measures how strong one attribute implies the other  Data transformation  Smoothing – binning, regression, clustering  Aggregation  Generalization – low level data is replaced by higher level concept through the use of concept hierarchy  Normalization – data is scaled to fall within a small specified range  Min-Max method  Z-score normalization  Decimal scaling  Attribute construction 15
  • 16.  Applied to obtain a reduced representation of data set  Data cube aggregation  Attribute subset selection reduces the data size by removing irrelevant or redundant attribute.  Dimensionality reduction involves data encoding or transformation to obtain compressed data. Lossy dimensionality reduction – wavelet transform, principal component analysis  Numerosity reduction  Parametric methods use a model to estimate data ex. Log-Linear model  Nonparametric method include histogram, clustering and sample for storing reduced representation  Discretization and concept hierarchy reduces the number of values for a given attribute by dividing the range of the attribute into intervals. 16
  • 17. DM task is divided into two categories: descriptive and predictive Descriptive mining task characterizes general properties of the data Predictive mining task performs inference on current data in order to make predictions 17
  • 18.  Data characterization is summarization of the general characterization or features of the target class of data.  Data corresponding to user specific class are typically collected by database query  Example: to study the characteristics of software products whose sales increased by 10%, data related to the product is collected  Data cube OLAP roll-up operation is used for data summarization  Output is presented in the form of pie charts, histogram  Data discrimination is comparison of the general features of target class data objects with general features of the object from one or a set of contrasting class.  Example: comparison of a product whose sales increased by 10% with that of a product whose sales decreased by 30% 18
  • 19.  Classification is the process of finding a model that describes or distinguishes data classes or concepts for the purpose of being able to use the model to predict the class of object whose class label is unknown.  Classifying loan as ‘safe’ or ‘risky’  Given a customer profile, guess whether he will buy a new computer  Decision tree induction  Bayesian classification  Rule-based classification  Classification by backpropogation  Support vector machines  Classification by association rule analysis 19
  • 20.  Prediction models continuous valued functions. Numeric prediction is the task of predicting continues values for the given input.  Regression analysis is a statistical methodology that is often used for numerical prediction  Linear/straight-line regression involves a response variable, y and a single predictor variable, x. It models y as a function of x. [y=b+wx]  Multiple linear regression extends straight-line regression to models more than one predictor variable  Nonlinear regression models polynomial terms 20
  • 21.  The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.  A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other cluster.  Class labels are not present in training data because they are not known to begin with. Clustering is used to generate such labels  Applications: taxonomy (organization of observations into hierarchy of classes that group similar events together) 21
  • 22.  Partitioning method  Partitioning method creates k partitions of the database of n objects of data tuples  Requirements  Each group must contain at least one object  Each object must belong to exactly one group  Objects in the same cluster are close or related to each other whereas objects of different cluster are fat apart or very different  k-means algorithm where each cluster is represented by the mean value of the objects  k-medoids algorithm where each cluster is represented by one of the objects located near the center of the cluster.  works well for small to medium databases 22
  • 23.  Hierarchical method  Created hierarchical decomposition of the given set of data objects.  Classification based on how hierarchical decomposition is formed  Agglomerative/Bottom-up approach merges objects or groups that are close to one another, until all the groups are merged into one  Divisive/Top-down approach starts with all of the objects in the same cluster. It breaks down into smaller cluster until eventually each object is in one cluster  Density-based method  Can easily determine clusters of arbitrary shape  Used to filter out noise  Grid based method  Quantize the object space into a finite number of cells that form a grid structure.  Faster processing  Model based clustering  Hypothesizes a model for each cluster and finds the best fit of the data to the given model  Locates cluster by constructing a density function that reflects spatial distribution of data  Automatically determines the number of clusters based on standard statistics  Example: self organizing maps 23
  • 24.  Clustering high dimensional data  examines objects having a number of features  Subspace clustering method searches for clusters in subspace  Frequent pattern based clustering extracts distinct frequent patterns among subset of dimensions that occur frequently  Constrain based clustering  Performs clustering by incorporating user-specific constrains  A constrain expresses a user’s expectations or desired results  Example: spatial clustering with the existence of obstacles and clustering under user specific constrains 24
  • 25.  Outliers are data that do not comply with the general behavior or model of data  Its discarded by most data mining applications. However, in applications like fraud detection, it worth noting. Example: fraudulent usage of credit cards by detecting purchases extremely of extremely large amount on a given day  Outliers may be detected by using a statistical test for probability model or using distance measure where objects that are a substantial distance from any other cluster is considered outlier.  Evolution analysis describes and models regularities or trends for objects whose behavior changes over time. 25
  • 26.  Massive data, temporally ordered, fast changing and potentially infinite is stream data.  Stream data flow in and out of a computer system continuously and with varying update rates.  Examples – real-time surveillance system, communication network, internet traffic, on-line transactions in financial markets or retail industry, electric power grids, industry production process and other dynamic environments.  It is impossible to store an entire data stream. Moreover, it tends to be of rather low level of abstraction. 26
  • 27.  Mining time-series data  A time-series database consist of sequence of values spread over repeated measurements of time.  Time-series database is popular in stock-market analysis, economic and sales forecasting, budgetary analysis, utility studies, yield studies, work-load projections, observation of natural phenomenon  Mining sequence patterns  A sequence database consist of sequence of ordered elements or events, recorded with or without a concrete notion of time. Sequential pattern mining is the discovery of frequently occurring ordered events or sequence of patterns.  Applications include customer shopping sequence, web clickstream, biological sequences, sequences of events in science and engineering. 27

Editor's Notes

  1. Steps 1-4 are different forms of data preprocessing
  2. From a data warehouse perspective, data mining can be viewed as an advanced stage of on-line analytical processing (OLAP).