08/05/14 Tanya Mathur Demo 1
Tanya Mathur
tanyamathur0908@gmail.com
Introduction to
Data Mining
08/05/14 Tanya Mathur Demo 2
• Define data mining
• Data mining vs. databases
• Basic data mining tasks
• Data mining development
• Data mining issues
Introduction Outline
Goal:Goal: Provide an overview of data mining.Provide an overview of data mining.
08/05/14 Tanya Mathur Demo 3
• Data is produced at a phenomenal rate
• Our ability to store has grown
• Users expect more sophisticated information
• How?
Introduction
UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION
DATA MININGDATA MINING
08/05/14 Tanya Mathur Demo 4
• Objective: Fit data to a model
• Potential Result: Higher-level meta information
that may not be obvious when looking at raw
data
• Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
Data Mining
08/05/14 Tanya Mathur Demo 5
• Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
• Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.
Data Mining vs. KDD
08/05/14 Tanya Mathur Demo 6
Knowledge Discovery Process
Data Cleaning
Data Integration
Databases
Preprocessed
Data
Task-relevant Data
Data transformations
Selection
Data Mining
Knowledge Interpretation
08/05/14 Tanya Mathur Demo 7
• Selection:
– Select log data (dates and locations) to use
• Preprocessing:
– Remove identifying URLs
– Remove error logs
• Transformation:
– Sessionize (sort and group)
• Data Mining:
– Identify and count patterns
– Construct data structure
• Interpretation/Evaluation:
– Identify and display frequently accessed sequences.
• Potential User Applications:
– Cache prediction
– Personalization
KDD Process Ex: Web Log
08/05/14 Tanya Mathur Demo 8
• Database
• Data Mining
Query Examples
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchasedFind all items which are frequently purchased
with milk. (association rules)with milk. (association rules)
– Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.
– Identify customers who have purchased moreIdentify customers who have purchased more
than $10,000 in the last month.than $10,000 in the last month.
– Find all credit applicants who are poor creditFind all credit applicants who are poor credit
risks. (classification)risks. (classification)
– Identify customers with similar buying habits.Identify customers with similar buying habits.
(Clustering)(Clustering)
Data Mining Models and Tasks
08/05/14 Tanya Mathur Demo 9
08/05/14 Tanya Mathur Demo 10
• Classification maps data into predefined groups
or classes
– Supervised learning
– Pattern recognition
– Prediction
• Regression is used to map a data item to a real
valued prediction variable.
• Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning
Basic Data Mining Tasks
08/05/14 Tanya Mathur Demo 11
• Summarization maps data into subsets
with associated simple descriptions.
– Characterization
– Generalization
• Link Analysis uncovers relationships
among data.
– Association Rules
– Sequential Analysis determines sequential
patterns.
Basic Data Mining Tasks
(cont’d)
Ex: Time Series Analysis
• Example: Stock Market
• Predict future values
• Determine similar patterns over time
• Classify behavior
08/05/14 Tanya Mathur Demo 12
08/05/14 Tanya Mathur Demo 13
Data Mining Development•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Neural Networks
•Decision Tree Algorithms
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
HIGH PERFORMANCE
DATA MINING
08/05/14 Tanya Mathur Demo 14
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
KDD Issues
08/05/14 Tanya Mathur Demo 15
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application
KDD Issues (cont’d)
08/05/14 Tanya Mathur Demo 16
• Privacy
• Profiling
• Unauthorized use
Social Implications of DM
08/05/14 Tanya Mathur Demo 17
• Usefulness
• Return on Investment (ROI)
• Accuracy
• Space/Time complexity
Data Mining Metrics

Data Mining Introduction

  • 1.
    08/05/14 Tanya MathurDemo 1 Tanya Mathur tanyamathur0908@gmail.com Introduction to Data Mining
  • 2.
    08/05/14 Tanya MathurDemo 2 • Define data mining • Data mining vs. databases • Basic data mining tasks • Data mining development • Data mining issues Introduction Outline Goal:Goal: Provide an overview of data mining.Provide an overview of data mining.
  • 3.
    08/05/14 Tanya MathurDemo 3 • Data is produced at a phenomenal rate • Our ability to store has grown • Users expect more sophisticated information • How? Introduction UNCOVER HIDDEN INFORMATIONUNCOVER HIDDEN INFORMATION DATA MININGDATA MINING
  • 4.
    08/05/14 Tanya MathurDemo 4 • Objective: Fit data to a model • Potential Result: Higher-level meta information that may not be obvious when looking at raw data • Similar terms – Exploratory data analysis – Data driven discovery – Deductive learning Data Mining
  • 5.
    08/05/14 Tanya MathurDemo 5 • Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. Data Mining vs. KDD
  • 6.
    08/05/14 Tanya MathurDemo 6 Knowledge Discovery Process Data Cleaning Data Integration Databases Preprocessed Data Task-relevant Data Data transformations Selection Data Mining Knowledge Interpretation
  • 7.
    08/05/14 Tanya MathurDemo 7 • Selection: – Select log data (dates and locations) to use • Preprocessing: – Remove identifying URLs – Remove error logs • Transformation: – Sessionize (sort and group) • Data Mining: – Identify and count patterns – Construct data structure • Interpretation/Evaluation: – Identify and display frequently accessed sequences. • Potential User Applications: – Cache prediction – Personalization KDD Process Ex: Web Log
  • 8.
    08/05/14 Tanya MathurDemo 8 • Database • Data Mining Query Examples – Find all customers who have purchased milkFind all customers who have purchased milk – Find all items which are frequently purchasedFind all items which are frequently purchased with milk. (association rules)with milk. (association rules) – Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith. – Identify customers who have purchased moreIdentify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month. – Find all credit applicants who are poor creditFind all credit applicants who are poor credit risks. (classification)risks. (classification) – Identify customers with similar buying habits.Identify customers with similar buying habits. (Clustering)(Clustering)
  • 9.
    Data Mining Modelsand Tasks 08/05/14 Tanya Mathur Demo 9
  • 10.
    08/05/14 Tanya MathurDemo 10 • Classification maps data into predefined groups or classes – Supervised learning – Pattern recognition – Prediction • Regression is used to map a data item to a real valued prediction variable. • Clustering groups similar data together into clusters. – Unsupervised learning – Segmentation – Partitioning Basic Data Mining Tasks
  • 11.
    08/05/14 Tanya MathurDemo 11 • Summarization maps data into subsets with associated simple descriptions. – Characterization – Generalization • Link Analysis uncovers relationships among data. – Association Rules – Sequential Analysis determines sequential patterns. Basic Data Mining Tasks (cont’d)
  • 12.
    Ex: Time SeriesAnalysis • Example: Stock Market • Predict future values • Determine similar patterns over time • Classify behavior 08/05/14 Tanya Mathur Demo 12
  • 13.
    08/05/14 Tanya MathurDemo 13 Data Mining Development•Similarity Measures •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Neural Networks •Decision Tree Algorithms •Algorithm Design Techniques •Algorithm Analysis •Data Structures •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques HIGH PERFORMANCE DATA MINING
  • 14.
    08/05/14 Tanya MathurDemo 14 • Human Interaction • Overfitting • Outliers • Interpretation • Visualization • Large Datasets • High Dimensionality KDD Issues
  • 15.
    08/05/14 Tanya MathurDemo 15 • Multimedia Data • Missing Data • Irrelevant Data • Noisy Data • Changing Data • Integration • Application KDD Issues (cont’d)
  • 16.
    08/05/14 Tanya MathurDemo 16 • Privacy • Profiling • Unauthorized use Social Implications of DM
  • 17.
    08/05/14 Tanya MathurDemo 17 • Usefulness • Return on Investment (ROI) • Accuracy • Space/Time complexity Data Mining Metrics