SlideShare a Scribd company logo
1 of 88
© Prentice Hall 1
Data Mining Outline
 PART I
– Introduction
– Related Concepts
– Data Mining Techniques
 PART II
– Classification
– Clustering
– Association Rules
 PART III
– Web Mining
– Spatial Mining
– Temporal Mining
© Prentice Hall 2
Introduction Outline
 Define data mining
 Data mining vs. databases
 Basic data mining tasks
 Data mining development
 Data mining issues
Goal: Provide an overview of data mining.
© Prentice Hall 3
Introduction
 Data is growing at a phenomenal rate
 Users expect more sophisticated
information
 How?
UNCOVER HIDDEN INFORMATION
DATA MINING
© Prentice Hall 4
Data Mining Definition
 Finding hidden information in a
database
 Fit data to a model
 Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
© Prentice Hall 5
Data Mining Algorithm
 Objective: Fit Data to a Model
– Descriptive
– Predictive
 Preference – Technique to choose the
best model
 Search – Technique to search the data
– “Query”
© Prentice Hall 6
Database Processing vs. Data
Mining Processing
 Query
– Well defined
– SQL
 Query
– Poorly defined
– No precise query language
 Data
– Operational data
 Output
– Precise
– Subset of database
 Data
– Not operational data
 Output
– Fuzzy
– Not a subset of database
© Prentice Hall 7
Query Examples
 Database
 Data Mining
– Find all customers who have purchased milk
– Find all items which are frequently purchased
with milk. (association rules)
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
© Prentice Hall 8
Data Mining Models and Tasks
© Prentice Hall 9
Basic Data Mining Tasks
 Classification maps data into predefined
groups or classes
– Supervised learning
– Pattern recognition
– Prediction
 Regression is used to map a data item to a
real valued prediction variable.
 Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning
© Prentice Hall 10
Basic Data Mining Tasks
(cont’d)
 Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
 Link Analysis uncovers relationships among
data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.
© Prentice Hall 11
Ex: Time Series Analysis
 Example: Stock Market
 Predict future values
 Determine similar patterns over time
 Classify behavior
© Prentice Hall 12
Data Mining vs. KDD
 Knowledge Discovery in Databases
(KDD): process of finding useful
information and patterns in data.
 Data Mining: Use of algorithms to
extract the information and patterns
derived by the KDD process.
© Prentice Hall 13
KDD Process
 Selection: Obtain data from various sources.
 Preprocessing: Cleanse data.
 Transformation: Convert to common format.
Transform to new format.
 Data Mining: Obtain desired results.
 Interpretation/Evaluation: Present results
to user in meaningful manner.
Modified from [FPSS96C]
© Prentice Hall 14
KDD Process Ex: Web Log
 Selection:
– Select log data (dates and locations) to use
 Preprocessing:
– Remove identifying URLs
– Remove error logs
 Transformation:
– Sessionize (sort and group)
 Data Mining:
– Identify and count patterns
– Construct data structure
 Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
 Potential User Applications:
– Cache prediction
– Personalization
© Prentice Hall 15
Data Mining Development
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
© Prentice Hall 16
KDD Issues
 Human Interaction
 Overfitting
 Outliers
 Interpretation
 Visualization
 Large Datasets
 High Dimensionality
© Prentice Hall 17
KDD Issues (cont’d)
 Multimedia Data
 Missing Data
 Irrelevant Data
 Noisy Data
 Changing Data
 Integration
 Application
© Prentice Hall 18
Social Implications of DM
 Privacy
 Profiling
 Unauthorized use
© Prentice Hall 19
Data Mining Metrics
 Usefulness
 Return on Investment (ROI)
 Accuracy
 Space/Time
© Prentice Hall 20
Database Perspective on Data
Mining
 Scalability
 Real World Data
 Updates
 Ease of Use
© Prentice Hall 21
Visualization Techniques
 Graphical
 Geometric
 Icon-based
 Pixel-based
 Hierarchical
 Hybrid
© Prentice Hall 22
Related Concepts Outline
 Database/OLTP Systems
 Fuzzy Sets and Logic
 Information Retrieval(Web Search Engines)
 Dimensional Modeling
 Data Warehousing
 OLAP/DSS
 Statistics
 Machine Learning
 Pattern Matching
Goal: Examine some areas which are related to
data mining.
© Prentice Hall 23
DB & OLTP Systems
 Schema
– (ID,Name,Address,Salary,JobNo)
 Data Model
– ER
– Relational
 Transaction
 Query:
SELECT Name
FROM T
WHERE Salary > 100000
DM: Only imprecise queries
© Prentice Hall 24
Fuzzy Sets and Logic
 Fuzzy Set: Set membership function is a real valued
function with output in the range [0,1].
 f(x): Probability x is in F.
 1-f(x): Probability x is not in F.
 EX:
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall
– Here f is the membership function
DM: Prediction and classification are fuzzy.
© Prentice Hall 25
Fuzzy Sets
© Prentice Hall 26
Classification/Prediction is
Fuzzy
Loan
Amnt
Simple Fuzzy
Accept Accept
Reject
Reject
© Prentice Hall 27
Information Retrieval
 Information Retrieval (IR): retrieving desired
information from textual data.
 Library Science
 Digital Libraries
 Web Search Engines
 Traditionally keyword based
 Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
© Prentice Hall 28
Information Retrieval (cont’d)
 Similarity: measure of how close a
query is to a document.
 Documents which are “close enough”
are retrieved.
 Metrics:
–Precision = |Relevant and Retrieved|
|Retrieved|
–Recall = |Relevant and Retrieved|
|Relevant|
© Prentice Hall 29
IR Query Result Measures
and Classification
IR Classification
© Prentice Hall 30
Dimensional Modeling
 View data in a hierarchical manner more as
business executives might
 Useful in decision support systems and mining
 Dimension: collection of logically related
attributes; axis for modeling data.
 Facts: data stored
 Ex: Dimensions – products, locations, date
Facts – quantity, unit price
DM: May view data as dimensional.
© Prentice Hall 31
Relational View of Data
ProdID LocID Date Quantity UnitPrice
123 Dallas 022900 5 25
123 Houston 020100 10 20
150 Dallas 031500 1 100
150 Dallas 031500 5 95
150 Fort
Worth
021000 5 80
150 Chicago 012000 20 75
200 Seattle 030100 5 50
300 Rochester 021500 200 5
500 Bradenton 022000 15 20
500 Chicago 012000 10 25
1
© Prentice Hall 32
Dimensional Modeling Queries
 Roll Up: more general dimension
 Drill Down: more specific dimension
 Dimension (Aggregation) Hierarchy
 SQL uses aggregation
 Decision Support Systems (DSS):
Computer systems and tools to assist
managers in making decisions and
solving problems.
© Prentice Hall 33
Cube view of Data
© Prentice Hall 34
Aggregation Hierarchies
© Prentice Hall 35
Star Schema
© Prentice Hall 36
Data Warehousing
 “Subject-oriented, integrated, time-variant, nonvolatile”
William Inmon
 Operational Data: Data used in day to day needs of
company.
 Informational Data: Supports other functions such as
planning and forecasting.
 Data mining tools often access data warehouses rather
than operational data.
DM: May access data in warehouse.
© Prentice Hall 37
Operational vs. Informational
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
© Prentice Hall 38
OLAP
 Online Analytic Processing (OLAP): provides more
complex queries than OLTP.
 OnLine Transaction Processing (OLTP): traditional
database/transaction processing.
 Dimensional data; cube view
 Visualization of operations:
– Slice: examine sub-cube.
– Dice: rotate cube to look at another dimension.
– Roll Up/Drill Down
DM: May use OLAP queries.
© Prentice Hall 39
OLAP Operations
Single Cell Multiple Cells Slice Dice
Roll Up
Drill Down
© Prentice Hall 40
Statistics
 Simple descriptive models
 Statistical inference: generalizing a model
created from a sample of the data to the entire
dataset.
 Exploratory Data Analysis:
– Data can actually drive the creation of the
model
– Opposite of traditional statistical view.
 Data mining targeted to business user
DM: Many data mining methods come
from statistical techniques.
© Prentice Hall 41
Machine Learning
 Machine Learning: area of AI that examines how to
write programs that can learn.
 Often used in classification and prediction
 Supervised Learning: learns by example.
 Unsupervised Learning: learns without knowledge of
correct answers.
 Machine learning often deals with small static datasets.
DM: Uses many machine learning
techniques.
© Prentice Hall 42
Pattern Matching
(Recognition)
 Pattern Matching: finds occurrences of
a predefined pattern in the data.
 Applications include speech recognition,
information retrieval, time series
analysis.
DM: Type of classification.
© Prentice Hall 43
DM vs. Related Topics
Area Query Data Results Output
DB/OLTP Precise Database Precise DB Objects
or
Aggregation
IR Precise Documents Vague Documents
OLAP Analysis Multidimensional Precise DB Objects
or
Aggregation
DM Vague Preprocessed Vague KDD
Objects
© Prentice Hall 44
Data Mining Techniques Outline
 Statistical
– Point Estimation
– Models Based on Summarization
– Bayes Theorem
– Hypothesis Testing
– Regression and Correlation
 Similarity Measures
 Decision Trees
 Neural Networks
– Activation Functions
 Genetic Algorithms
Goal: Provide an overview of basic data
mining techniques
© Prentice Hall 45
Point Estimation
 Point Estimate: estimate a population
parameter.
 May be made by calculating the parameter for a
sample.
 May be used to predict value for missing data.
 Ex:
– R contains 100 employees
– 99 have salary information
– Mean salary of these is $50,000
– Use $50,000 as value of remaining employee’s
salary.
Is this a good idea?
© Prentice Hall 46
Estimation Error
 Bias: Difference between expected value and
actual value.
 Mean Squared Error (MSE): expected value
of the squared difference between the
estimate and the actual value:
 Why square?
 Root Mean Square Error (RMSE)
© Prentice Hall 47
Jackknife Estimate
 Jackknife Estimate: estimate of parameter
is obtained by omitting one value from the set
of observed values.
 Ex: estimate of mean for X={x1, … , xn}
© Prentice Hall 48
Maximum Likelihood
Estimate (MLE)
 Obtain parameter estimates that maximize
the probability that the sample data occurs for
the specific model.
 Joint probability for observing the sample
data by multiplying the individual probabilities.
Likelihood function:
 Maximize L.
© Prentice Hall 49
MLE Example
 Coin toss five times: {H,H,H,H,T}
 Assuming a perfect coin with H and T equally
likely, the likelihood of this sequence is:
 However if the probability of a H is 0.8 then:
© Prentice Hall 50
MLE Example (cont’d)
 General likelihood formula:
 Estimate for p is then 4/5 = 0.8
© Prentice Hall 51
Expectation-Maximization
(EM)
 Solves estimation with incomplete data.
 Obtain initial estimates for parameters.
 Iteratively use estimates for missing
data and continue until convergence.
© Prentice Hall 52
EM Example
© Prentice Hall 53
EM Algorithm
© Prentice Hall 54
Models Based on Summarization
 Visualization: Frequency distribution, mean, variance,
median, mode, etc.
 Box Plot:
© Prentice Hall 55
Scatter Diagram
© Prentice Hall 56
Bayes Theorem
 Posterior Probability: P(h1|xi)
 Prior Probability: P(h1)
 Bayes Theorem:
 Assign probabilities of hypotheses given a
data value.
© Prentice Hall 57
Bayes Theorem Example
 Credit authorizations (hypotheses):
h1=authorize purchase, h2 = authorize after
further identification, h3=do not authorize,
h4= do not authorize but contact police
 Assign twelve data values for all
combinations of credit and income:
 From training data: P(h1) = 60%; P(h2)=20%;
P(h3)=10%; P(h4)=10%.
1 2 3 4
Excellent x1 x2 x3 x4
Good x5 x6 x7 x8
Bad x9 x10 x11 x12
© Prentice Hall 58
Bayes Example(cont’d)
 Training Data:
ID Income Credit Class xi
1 4 Excellent h1 x4
2 3 Good h1 x7
3 2 Excellent h1 x2
4 3 Good h1 x7
5 4 Good h1 x8
6 2 Excellent h1 x2
7 3 Bad h2 x11
8 2 Bad h2 x10
9 3 Bad h3 x11
10 1 Bad h4 x9
© Prentice Hall 59
Bayes Example(cont’d)
 Calculate P(xi|hj) and P(xi)
 Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6;
P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.
 Predict the class for x4:
– Calculate P(hj|x4) for all hj.
– Place x4 in class with largest value.
– Ex:
»P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
=(1/6)(0.6)/0.1=1.
»x4 in class h1.
© Prentice Hall 60
Hypothesis Testing
 Find model to explain behavior by
creating and then testing a hypothesis
about the data.
 Exact opposite of usual DM approach.
 H0 – Null hypothesis; Hypothesis to be
tested.
 H1 – Alternative hypothesis
© Prentice Hall 61
Chi Squared Statistic
 O – observed value
 E – Expected value based on hypothesis.
 Ex:
– O={50,93,67,78,87}
– E=75
– c2=15.55 and therefore significant
© Prentice Hall 62
Regression
 Predict future values based on past
values
 Linear Regression assumes linear
relationship exists.
y = c0 + c1 x1 + … + cn xn
 Find values to best fit the data
© Prentice Hall 63
Linear Regression
© Prentice Hall 64
Correlation
 Examine the degree to which the values
for two variables behave similarly.
 Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation
© Prentice Hall 65
Similarity Measures
 Determine similarity between two objects.
 Similarity characteristics:
 Alternatively, distance measure measure how
unlike or dissimilar objects are.
© Prentice Hall 66
Similarity Measures
© Prentice Hall 67
Distance Measures
 Measure dissimilarity between objects
© Prentice Hall 68
Twenty Questions Game
© Prentice Hall 69
Decision Trees
 Decision Tree (DT):
– Tree where the root and each internal node is
labeled with a question.
– The arcs represent each possible answer to
the associated question.
– Each leaf node represents a prediction of a
solution to the problem.
 Popular technique for classification; Leaf
node indicates class to which the
corresponding tuple belongs.
© Prentice Hall 70
Decision Tree Example
© Prentice Hall 71
Decision Trees
 A Decision Tree Model is a computational
model consisting of three parts:
– Decision Tree
– Algorithm to create the tree
– Algorithm that applies the tree to data
 Creation of the tree is the most difficult part.
 Processing is basically a search similar to
that in a binary search tree (although DT may
not be binary).
© Prentice Hall 72
Decision Tree Algorithm
© Prentice Hall 73
DT
Advantages/Disadvantages
 Advantages:
– Easy to understand.
– Easy to generate rules
 Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning.
– Does not easily handle nonnumeric data.
– Can be quite large – pruning is necessary.
© Prentice Hall 74
Neural Networks
 Based on observed functioning of human
brain.
 (Artificial Neural Networks (ANN)
 Our view of neural networks is very
simplistic.
 We view a neural network (NN) from a
graphical viewpoint.
 Alternatively, a NN may be viewed from
the perspective of matrices.
 Used in pattern recognition, speech
recognition, computer vision, and
classification.
© Prentice Hall 75
Neural Networks
 Neural Network (NN) is a directed graph
F=<V,A> with vertices V={1,2,…,n} and arcs
A={<i,j>|1<=i,j<=n}, with the following
restrictions:
– V is partitioned into a set of input nodes, VI,
hidden nodes, VH, and output nodes, VO.
– The vertices are also partitioned into layers
– Any arc <i,j> must have node i in layer h-1
and node j in layer h.
– Arc <i,j> is labeled with a numeric value wij.
– Node i is labeled with a function fi.
© Prentice Hall 76
Neural Network Example
© Prentice Hall 77
NN Node
© Prentice Hall 78
NN Activation Functions
 Functions associated with nodes in
graph.
 Output may be in range [-1,1] or [0,1]
© Prentice Hall 79
NN Activation Functions
© Prentice Hall 80
NN Learning
 Propagate input values through graph.
 Compare output to desired output.
 Adjust weights in graph accordingly.
© Prentice Hall 81
Neural Networks
 A Neural Network Model is a computational
model consisting of three parts:
– Neural Network graph
– Learning algorithm that indicates how
learning takes place.
– Recall techniques that determine hew
information is obtained from the network.
 We will look at propagation as the recall
technique.
© Prentice Hall 82
NN Advantages
 Learning
 Can continue learning even after
training set has been applied.
 Easy parallelization
 Solves many problems
© Prentice Hall 83
NN Disadvantages
 Difficult to understand
 May suffer from overfitting
 Structure of graph must be determined
a priori.
 Input values must be numeric.
 Verification difficult.
© Prentice Hall 84
Genetic Algorithms
 Optimization search type algorithms.
 Creates an initial feasible solution and
iteratively creates new “better” solutions.
 Based on human evolution and survival of the
fittest.
 Must represent a solution as an individual.
 Individual: string I=I1,I2,…,In where Ij is in
given alphabet A.
 Each character Ij is called a gene.
 Population: set of individuals.
© Prentice Hall 85
Genetic Algorithms
 A Genetic Algorithm (GA) is a
computational model consisting of five parts:
– A starting set of individuals, P.
– Crossover: technique to combine two
parents to create offspring.
– Mutation: randomly change an individual.
– Fitness: determine the best individuals.
– Algorithm which applies the crossover and
mutation techniques to P iteratively using
the fitness function to determine the best
individuals in P to keep.
© Prentice Hall 86
Crossover Examples
111 111
000 000
Parents Children
111 000
000 111
a) Single Crossover
111 111
Parents Children
111 000
000
a) Single Crossover
111 111
000 000
Parents
a) Multiple Crossover
111 111
000
Parents Children
111 000
000 111
Children
111 000
000 11100
11
00
11
© Prentice Hall 87
Genetic Algorithm
© Prentice Hall 88
GA Advantages/Disadvantages
 Advantages
– Easily parallelized
 Disadvantages
– Difficult to understand and explain to end
users.
– Abstraction of the problem and method to
represent individuals is quite difficult.
– Determining fitness function is difficult.
– Determining how to perform crossover and
mutation is difficult.

More Related Content

What's hot

What's hot (20)

Model selection and cross validation techniques
Model selection and cross validation techniquesModel selection and cross validation techniques
Model selection and cross validation techniques
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Metadata ppt
Metadata pptMetadata ppt
Metadata ppt
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data warehouse 21 snowflake schema
Data warehouse 21 snowflake schemaData warehouse 21 snowflake schema
Data warehouse 21 snowflake schema
 
Data preparation
Data preparationData preparation
Data preparation
 
Knowledge discovery process
Knowledge discovery process Knowledge discovery process
Knowledge discovery process
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Ordbms
OrdbmsOrdbms
Ordbms
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database
 

Viewers also liked

01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Solvency II - CIC Reflection
Solvency II - CIC ReflectionSolvency II - CIC Reflection
Solvency II - CIC ReflectionThu-Phuong DO
 
05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)
05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)
05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)溫秀嬌
 
11.歐洲旅行19天.結語.景點滿意度統計表
11.歐洲旅行19天.結語.景點滿意度統計表11.歐洲旅行19天.結語.景點滿意度統計表
11.歐洲旅行19天.結語.景點滿意度統計表溫秀嬌
 
USER STUDY FOR EXPLORATION OF USERS NEEDS
USER STUDY FOR EXPLORATION OF USERS NEEDS USER STUDY FOR EXPLORATION OF USERS NEEDS
USER STUDY FOR EXPLORATION OF USERS NEEDS IAEME Publication
 
A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM
A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM
A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM IAEME Publication
 
RESOURCE SHARING: A LIBRARY PERCEPTIVE
RESOURCE SHARING: A LIBRARY PERCEPTIVE RESOURCE SHARING: A LIBRARY PERCEPTIVE
RESOURCE SHARING: A LIBRARY PERCEPTIVE IAEME Publication
 
Knowledge Discovery in Databases
Knowledge Discovery in DatabasesKnowledge Discovery in Databases
Knowledge Discovery in DatabasesDiwas Kandel
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Harish Chand
 
Self empowerment winarto aspri seminar
Self empowerment winarto aspri seminarSelf empowerment winarto aspri seminar
Self empowerment winarto aspri seminarhenry jaya teddy
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...
Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...
Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...vitlenko
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 

Viewers also liked (20)

01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Cuento para el blog
Cuento para el blogCuento para el blog
Cuento para el blog
 
Solvency II - CIC Reflection
Solvency II - CIC ReflectionSolvency II - CIC Reflection
Solvency II - CIC Reflection
 
Bab iii matematika
Bab iii matematikaBab iii matematika
Bab iii matematika
 
05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)
05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)
05.晨遊德國烏帕塔城市.購物.晚餐.到杜塞道夫機場領回行李(.2016.05.22)
 
11.歐洲旅行19天.結語.景點滿意度統計表
11.歐洲旅行19天.結語.景點滿意度統計表11.歐洲旅行19天.結語.景點滿意度統計表
11.歐洲旅行19天.結語.景點滿意度統計表
 
USER STUDY FOR EXPLORATION OF USERS NEEDS
USER STUDY FOR EXPLORATION OF USERS NEEDS USER STUDY FOR EXPLORATION OF USERS NEEDS
USER STUDY FOR EXPLORATION OF USERS NEEDS
 
MSc_thesis
MSc_thesisMSc_thesis
MSc_thesis
 
A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM
A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM
A REVIEW ON LOAD BALANCING IN CLOUD USING ENHANCED GENETIC ALGORITHM
 
Data mining and knowledge Discovery
Data mining and knowledge DiscoveryData mining and knowledge Discovery
Data mining and knowledge Discovery
 
Professional code of ethics
Professional code of ethicsProfessional code of ethics
Professional code of ethics
 
RESOURCE SHARING: A LIBRARY PERCEPTIVE
RESOURCE SHARING: A LIBRARY PERCEPTIVE RESOURCE SHARING: A LIBRARY PERCEPTIVE
RESOURCE SHARING: A LIBRARY PERCEPTIVE
 
Knowledge Discovery in Databases
Knowledge Discovery in DatabasesKnowledge Discovery in Databases
Knowledge Discovery in Databases
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
Kdd process
Kdd processKdd process
Kdd process
 
Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)Data mining & data warehousing (ppt)
Data mining & data warehousing (ppt)
 
Self empowerment winarto aspri seminar
Self empowerment winarto aspri seminarSelf empowerment winarto aspri seminar
Self empowerment winarto aspri seminar
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...
Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...
Каталог распределителей, шкафчиков и дополнительных элементов (www.rbu-osnova...
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similar to Data mining

Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesDenodo
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingPrithwis Mukerjee
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBigDataExpo
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxshumPanwar
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guidegokulprasath06
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationDenodo
 
Dw-dm-part-01
Dw-dm-part-01Dw-dm-part-01
Dw-dm-part-01nash512
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl conceptsjeshocarme
 
Foundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureFoundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureInside Analysis
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 
The Database Environment Chapter 11
The Database Environment Chapter 11The Database Environment Chapter 11
The Database Environment Chapter 11Jeanie Arnoco
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data MiningRanak Ghosh
 
Decision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDecision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDLT Solutions
 

Similar to Data mining (20)

Part1
Part1Part1
Part1
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
2. olap warehouse
2. olap warehouse2. olap warehouse
2. olap warehouse
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
 
finalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptxfinalestkddfinalpresentation-111207021040-phpapp01.pptx
finalestkddfinalpresentation-111207021040-phpapp01.pptx
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
dwdm unit 1.ppt
dwdm unit 1.pptdwdm unit 1.ppt
dwdm unit 1.ppt
 
Dw-dm-part-01
Dw-dm-part-01Dw-dm-part-01
Dw-dm-part-01
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 
Foundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information ArchitectureFoundation for Success: How Big Data Fits in an Information Architecture
Foundation for Success: How Big Data Fits in an Information Architecture
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
The Database Environment Chapter 11
The Database Environment Chapter 11The Database Environment Chapter 11
The Database Environment Chapter 11
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
Decision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great DataDecision Ready Data: Power Your Analytics with Great Data
Decision Ready Data: Power Your Analytics with Great Data
 

More from Hoang Nguyen

Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your siteHoang Nguyen
 
How to build a rest api
How to build a rest apiHow to build a rest api
How to build a rest apiHoang Nguyen
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsHoang Nguyen
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksHoang Nguyen
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheHoang Nguyen
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceHoang Nguyen
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendHoang Nguyen
 
Python language data types
Python language data typesPython language data types
Python language data typesHoang Nguyen
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in pythonHoang Nguyen
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with pythonHoang Nguyen
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and pythonHoang Nguyen
 
Object oriented programming using c++
Object oriented programming using c++Object oriented programming using c++
Object oriented programming using c++Hoang Nguyen
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisHoang Nguyen
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithmsHoang Nguyen
 

More from Hoang Nguyen (20)

Rest api to integrate with your site
Rest api to integrate with your siteRest api to integrate with your site
Rest api to integrate with your site
 
How to build a rest api
How to build a rest apiHow to build a rest api
How to build a rest api
 
Api crash
Api crashApi crash
Api crash
 
Smm and caching
Smm and cachingSmm and caching
Smm and caching
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Cache recap
Cache recapCache recap
Cache recap
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python language data types
Python language data typesPython language data types
Python language data types
 
Python basics
Python basicsPython basics
Python basics
 
Programming for engineers in python
Programming for engineers in pythonProgramming for engineers in python
Programming for engineers in python
 
Learning python
Learning pythonLearning python
Learning python
 
Extending burp with python
Extending burp with pythonExtending burp with python
Extending burp with python
 
Cobol, lisp, and python
Cobol, lisp, and pythonCobol, lisp, and python
Cobol, lisp, and python
 
Object oriented programming using c++
Object oriented programming using c++Object oriented programming using c++
Object oriented programming using c++
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Object model
Object modelObject model
Object model
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

Data mining

  • 1. © Prentice Hall 1 Data Mining Outline  PART I – Introduction – Related Concepts – Data Mining Techniques  PART II – Classification – Clustering – Association Rules  PART III – Web Mining – Spatial Mining – Temporal Mining
  • 2. © Prentice Hall 2 Introduction Outline  Define data mining  Data mining vs. databases  Basic data mining tasks  Data mining development  Data mining issues Goal: Provide an overview of data mining.
  • 3. © Prentice Hall 3 Introduction  Data is growing at a phenomenal rate  Users expect more sophisticated information  How? UNCOVER HIDDEN INFORMATION DATA MINING
  • 4. © Prentice Hall 4 Data Mining Definition  Finding hidden information in a database  Fit data to a model  Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning
  • 5. © Prentice Hall 5 Data Mining Algorithm  Objective: Fit Data to a Model – Descriptive – Predictive  Preference – Technique to choose the best model  Search – Technique to search the data – “Query”
  • 6. © Prentice Hall 6 Database Processing vs. Data Mining Processing  Query – Well defined – SQL  Query – Poorly defined – No precise query language  Data – Operational data  Output – Precise – Subset of database  Data – Not operational data  Output – Fuzzy – Not a subset of database
  • 7. © Prentice Hall 7 Query Examples  Database  Data Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)
  • 8. © Prentice Hall 8 Data Mining Models and Tasks
  • 9. © Prentice Hall 9 Basic Data Mining Tasks  Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction  Regression is used to map a data item to a real valued prediction variable.  Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning
  • 10. © Prentice Hall 10 Basic Data Mining Tasks (cont’d)  Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization  Link Analysis uncovers relationships among data. – Affinity Analysis – Association Rules – Sequential Analysis determines sequential patterns.
  • 11. © Prentice Hall 11 Ex: Time Series Analysis  Example: Stock Market  Predict future values  Determine similar patterns over time  Classify behavior
  • 12. © Prentice Hall 12 Data Mining vs. KDD  Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data.  Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
  • 13. © Prentice Hall 13 KDD Process  Selection: Obtain data from various sources.  Preprocessing: Cleanse data.  Transformation: Convert to common format. Transform to new format.  Data Mining: Obtain desired results.  Interpretation/Evaluation: Present results to user in meaningful manner. Modified from [FPSS96C]
  • 14. © Prentice Hall 14 KDD Process Ex: Web Log  Selection: – Select log data (dates and locations) to use  Preprocessing: – Remove identifying URLs – Remove error logs  Transformation: – Sessionize (sort and group)  Data Mining: – Identify and count patterns – Construct data structure  Interpretation/Evaluation: – Identify and display frequently accessed sequences.  Potential User Applications: – Cache prediction – Personalization
  • 15. © Prentice Hall 15 Data Mining Development •Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Neural Networks •Decision Tree Algorithms •Algorithm Design Techniques •Algorithm Analysis •Data Structures •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques
  • 16. © Prentice Hall 16 KDD Issues  Human Interaction  Overfitting  Outliers  Interpretation  Visualization  Large Datasets  High Dimensionality
  • 17. © Prentice Hall 17 KDD Issues (cont’d)  Multimedia Data  Missing Data  Irrelevant Data  Noisy Data  Changing Data  Integration  Application
  • 18. © Prentice Hall 18 Social Implications of DM  Privacy  Profiling  Unauthorized use
  • 19. © Prentice Hall 19 Data Mining Metrics  Usefulness  Return on Investment (ROI)  Accuracy  Space/Time
  • 20. © Prentice Hall 20 Database Perspective on Data Mining  Scalability  Real World Data  Updates  Ease of Use
  • 21. © Prentice Hall 21 Visualization Techniques  Graphical  Geometric  Icon-based  Pixel-based  Hierarchical  Hybrid
  • 22. © Prentice Hall 22 Related Concepts Outline  Database/OLTP Systems  Fuzzy Sets and Logic  Information Retrieval(Web Search Engines)  Dimensional Modeling  Data Warehousing  OLAP/DSS  Statistics  Machine Learning  Pattern Matching Goal: Examine some areas which are related to data mining.
  • 23. © Prentice Hall 23 DB & OLTP Systems  Schema – (ID,Name,Address,Salary,JobNo)  Data Model – ER – Relational  Transaction  Query: SELECT Name FROM T WHERE Salary > 100000 DM: Only imprecise queries
  • 24. © Prentice Hall 24 Fuzzy Sets and Logic  Fuzzy Set: Set membership function is a real valued function with output in the range [0,1].  f(x): Probability x is in F.  1-f(x): Probability x is not in F.  EX: – T = {x | x is a person and x is tall} – Let f(x) be the probability that x is tall – Here f is the membership function DM: Prediction and classification are fuzzy.
  • 25. © Prentice Hall 25 Fuzzy Sets
  • 26. © Prentice Hall 26 Classification/Prediction is Fuzzy Loan Amnt Simple Fuzzy Accept Accept Reject Reject
  • 27. © Prentice Hall 27 Information Retrieval  Information Retrieval (IR): retrieving desired information from textual data.  Library Science  Digital Libraries  Web Search Engines  Traditionally keyword based  Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data.
  • 28. © Prentice Hall 28 Information Retrieval (cont’d)  Similarity: measure of how close a query is to a document.  Documents which are “close enough” are retrieved.  Metrics: –Precision = |Relevant and Retrieved| |Retrieved| –Recall = |Relevant and Retrieved| |Relevant|
  • 29. © Prentice Hall 29 IR Query Result Measures and Classification IR Classification
  • 30. © Prentice Hall 30 Dimensional Modeling  View data in a hierarchical manner more as business executives might  Useful in decision support systems and mining  Dimension: collection of logically related attributes; axis for modeling data.  Facts: data stored  Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensional.
  • 31. © Prentice Hall 31 Relational View of Data ProdID LocID Date Quantity UnitPrice 123 Dallas 022900 5 25 123 Houston 020100 10 20 150 Dallas 031500 1 100 150 Dallas 031500 5 95 150 Fort Worth 021000 5 80 150 Chicago 012000 20 75 200 Seattle 030100 5 50 300 Rochester 021500 200 5 500 Bradenton 022000 15 20 500 Chicago 012000 10 25 1
  • 32. © Prentice Hall 32 Dimensional Modeling Queries  Roll Up: more general dimension  Drill Down: more specific dimension  Dimension (Aggregation) Hierarchy  SQL uses aggregation  Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.
  • 33. © Prentice Hall 33 Cube view of Data
  • 34. © Prentice Hall 34 Aggregation Hierarchies
  • 35. © Prentice Hall 35 Star Schema
  • 36. © Prentice Hall 36 Data Warehousing  “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon  Operational Data: Data used in day to day needs of company.  Informational Data: Supports other functions such as planning and forecasting.  Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse.
  • 37. © Prentice Hall 37 Operational vs. Informational Operational Data Data Warehouse Application OLTP OLAP Use Precise Queries Ad Hoc Temporal Snapshot Historical Modification Dynamic Static Orientation Application Business Data Operational Values Integrated Size Gigabits Terabits Level Detailed Summarized Access Often Less Often Response Few Seconds Minutes Data Schema Relational Star/Snowflake
  • 38. © Prentice Hall 38 OLAP  Online Analytic Processing (OLAP): provides more complex queries than OLTP.  OnLine Transaction Processing (OLTP): traditional database/transaction processing.  Dimensional data; cube view  Visualization of operations: – Slice: examine sub-cube. – Dice: rotate cube to look at another dimension. – Roll Up/Drill Down DM: May use OLAP queries.
  • 39. © Prentice Hall 39 OLAP Operations Single Cell Multiple Cells Slice Dice Roll Up Drill Down
  • 40. © Prentice Hall 40 Statistics  Simple descriptive models  Statistical inference: generalizing a model created from a sample of the data to the entire dataset.  Exploratory Data Analysis: – Data can actually drive the creation of the model – Opposite of traditional statistical view.  Data mining targeted to business user DM: Many data mining methods come from statistical techniques.
  • 41. © Prentice Hall 41 Machine Learning  Machine Learning: area of AI that examines how to write programs that can learn.  Often used in classification and prediction  Supervised Learning: learns by example.  Unsupervised Learning: learns without knowledge of correct answers.  Machine learning often deals with small static datasets. DM: Uses many machine learning techniques.
  • 42. © Prentice Hall 42 Pattern Matching (Recognition)  Pattern Matching: finds occurrences of a predefined pattern in the data.  Applications include speech recognition, information retrieval, time series analysis. DM: Type of classification.
  • 43. © Prentice Hall 43 DM vs. Related Topics Area Query Data Results Output DB/OLTP Precise Database Precise DB Objects or Aggregation IR Precise Documents Vague Documents OLAP Analysis Multidimensional Precise DB Objects or Aggregation DM Vague Preprocessed Vague KDD Objects
  • 44. © Prentice Hall 44 Data Mining Techniques Outline  Statistical – Point Estimation – Models Based on Summarization – Bayes Theorem – Hypothesis Testing – Regression and Correlation  Similarity Measures  Decision Trees  Neural Networks – Activation Functions  Genetic Algorithms Goal: Provide an overview of basic data mining techniques
  • 45. © Prentice Hall 45 Point Estimation  Point Estimate: estimate a population parameter.  May be made by calculating the parameter for a sample.  May be used to predict value for missing data.  Ex: – R contains 100 employees – 99 have salary information – Mean salary of these is $50,000 – Use $50,000 as value of remaining employee’s salary. Is this a good idea?
  • 46. © Prentice Hall 46 Estimation Error  Bias: Difference between expected value and actual value.  Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:  Why square?  Root Mean Square Error (RMSE)
  • 47. © Prentice Hall 47 Jackknife Estimate  Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.  Ex: estimate of mean for X={x1, … , xn}
  • 48. © Prentice Hall 48 Maximum Likelihood Estimate (MLE)  Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.  Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:  Maximize L.
  • 49. © Prentice Hall 49 MLE Example  Coin toss five times: {H,H,H,H,T}  Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:  However if the probability of a H is 0.8 then:
  • 50. © Prentice Hall 50 MLE Example (cont’d)  General likelihood formula:  Estimate for p is then 4/5 = 0.8
  • 51. © Prentice Hall 51 Expectation-Maximization (EM)  Solves estimation with incomplete data.  Obtain initial estimates for parameters.  Iteratively use estimates for missing data and continue until convergence.
  • 52. © Prentice Hall 52 EM Example
  • 53. © Prentice Hall 53 EM Algorithm
  • 54. © Prentice Hall 54 Models Based on Summarization  Visualization: Frequency distribution, mean, variance, median, mode, etc.  Box Plot:
  • 55. © Prentice Hall 55 Scatter Diagram
  • 56. © Prentice Hall 56 Bayes Theorem  Posterior Probability: P(h1|xi)  Prior Probability: P(h1)  Bayes Theorem:  Assign probabilities of hypotheses given a data value.
  • 57. © Prentice Hall 57 Bayes Theorem Example  Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police  Assign twelve data values for all combinations of credit and income:  From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%. 1 2 3 4 Excellent x1 x2 x3 x4 Good x5 x6 x7 x8 Bad x9 x10 x11 x12
  • 58. © Prentice Hall 58 Bayes Example(cont’d)  Training Data: ID Income Credit Class xi 1 4 Excellent h1 x4 2 3 Good h1 x7 3 2 Excellent h1 x2 4 3 Good h1 x7 5 4 Good h1 x8 6 2 Excellent h1 x2 7 3 Bad h2 x11 8 2 Bad h2 x10 9 3 Bad h3 x11 10 1 Bad h4 x9
  • 59. © Prentice Hall 59 Bayes Example(cont’d)  Calculate P(xi|hj) and P(xi)  Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.  Predict the class for x4: – Calculate P(hj|x4) for all hj. – Place x4 in class with largest value. – Ex: »P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) =(1/6)(0.6)/0.1=1. »x4 in class h1.
  • 60. © Prentice Hall 60 Hypothesis Testing  Find model to explain behavior by creating and then testing a hypothesis about the data.  Exact opposite of usual DM approach.  H0 – Null hypothesis; Hypothesis to be tested.  H1 – Alternative hypothesis
  • 61. © Prentice Hall 61 Chi Squared Statistic  O – observed value  E – Expected value based on hypothesis.  Ex: – O={50,93,67,78,87} – E=75 – c2=15.55 and therefore significant
  • 62. © Prentice Hall 62 Regression  Predict future values based on past values  Linear Regression assumes linear relationship exists. y = c0 + c1 x1 + … + cn xn  Find values to best fit the data
  • 63. © Prentice Hall 63 Linear Regression
  • 64. © Prentice Hall 64 Correlation  Examine the degree to which the values for two variables behave similarly.  Correlation coefficient r: • 1 = perfect correlation • -1 = perfect but opposite correlation • 0 = no correlation
  • 65. © Prentice Hall 65 Similarity Measures  Determine similarity between two objects.  Similarity characteristics:  Alternatively, distance measure measure how unlike or dissimilar objects are.
  • 66. © Prentice Hall 66 Similarity Measures
  • 67. © Prentice Hall 67 Distance Measures  Measure dissimilarity between objects
  • 68. © Prentice Hall 68 Twenty Questions Game
  • 69. © Prentice Hall 69 Decision Trees  Decision Tree (DT): – Tree where the root and each internal node is labeled with a question. – The arcs represent each possible answer to the associated question. – Each leaf node represents a prediction of a solution to the problem.  Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.
  • 70. © Prentice Hall 70 Decision Tree Example
  • 71. © Prentice Hall 71 Decision Trees  A Decision Tree Model is a computational model consisting of three parts: – Decision Tree – Algorithm to create the tree – Algorithm that applies the tree to data  Creation of the tree is the most difficult part.  Processing is basically a search similar to that in a binary search tree (although DT may not be binary).
  • 72. © Prentice Hall 72 Decision Tree Algorithm
  • 73. © Prentice Hall 73 DT Advantages/Disadvantages  Advantages: – Easy to understand. – Easy to generate rules  Disadvantages: – May suffer from overfitting. – Classifies by rectangular partitioning. – Does not easily handle nonnumeric data. – Can be quite large – pruning is necessary.
  • 74. © Prentice Hall 74 Neural Networks  Based on observed functioning of human brain.  (Artificial Neural Networks (ANN)  Our view of neural networks is very simplistic.  We view a neural network (NN) from a graphical viewpoint.  Alternatively, a NN may be viewed from the perspective of matrices.  Used in pattern recognition, speech recognition, computer vision, and classification.
  • 75. © Prentice Hall 75 Neural Networks  Neural Network (NN) is a directed graph F=<V,A> with vertices V={1,2,…,n} and arcs A={<i,j>|1<=i,j<=n}, with the following restrictions: – V is partitioned into a set of input nodes, VI, hidden nodes, VH, and output nodes, VO. – The vertices are also partitioned into layers – Any arc <i,j> must have node i in layer h-1 and node j in layer h. – Arc <i,j> is labeled with a numeric value wij. – Node i is labeled with a function fi.
  • 76. © Prentice Hall 76 Neural Network Example
  • 77. © Prentice Hall 77 NN Node
  • 78. © Prentice Hall 78 NN Activation Functions  Functions associated with nodes in graph.  Output may be in range [-1,1] or [0,1]
  • 79. © Prentice Hall 79 NN Activation Functions
  • 80. © Prentice Hall 80 NN Learning  Propagate input values through graph.  Compare output to desired output.  Adjust weights in graph accordingly.
  • 81. © Prentice Hall 81 Neural Networks  A Neural Network Model is a computational model consisting of three parts: – Neural Network graph – Learning algorithm that indicates how learning takes place. – Recall techniques that determine hew information is obtained from the network.  We will look at propagation as the recall technique.
  • 82. © Prentice Hall 82 NN Advantages  Learning  Can continue learning even after training set has been applied.  Easy parallelization  Solves many problems
  • 83. © Prentice Hall 83 NN Disadvantages  Difficult to understand  May suffer from overfitting  Structure of graph must be determined a priori.  Input values must be numeric.  Verification difficult.
  • 84. © Prentice Hall 84 Genetic Algorithms  Optimization search type algorithms.  Creates an initial feasible solution and iteratively creates new “better” solutions.  Based on human evolution and survival of the fittest.  Must represent a solution as an individual.  Individual: string I=I1,I2,…,In where Ij is in given alphabet A.  Each character Ij is called a gene.  Population: set of individuals.
  • 85. © Prentice Hall 85 Genetic Algorithms  A Genetic Algorithm (GA) is a computational model consisting of five parts: – A starting set of individuals, P. – Crossover: technique to combine two parents to create offspring. – Mutation: randomly change an individual. – Fitness: determine the best individuals. – Algorithm which applies the crossover and mutation techniques to P iteratively using the fitness function to determine the best individuals in P to keep.
  • 86. © Prentice Hall 86 Crossover Examples 111 111 000 000 Parents Children 111 000 000 111 a) Single Crossover 111 111 Parents Children 111 000 000 a) Single Crossover 111 111 000 000 Parents a) Multiple Crossover 111 111 000 Parents Children 111 000 000 111 Children 111 000 000 11100 11 00 11
  • 87. © Prentice Hall 87 Genetic Algorithm
  • 88. © Prentice Hall 88 GA Advantages/Disadvantages  Advantages – Easily parallelized  Disadvantages – Difficult to understand and explain to end users. – Abstraction of the problem and method to represent individuals is quite difficult. – Determining fitness function is difficult. – Determining how to perform crossover and mutation is difficult.