More Related Content
Similar to DM Lecture 2 (20)
DM Lecture 2
- 2. Outline
Motivation: Why Data Mining?
What is Data Mining?
History of Data Mining
Data Mining Functionality and Terminology
Data Mining Applications
Are all the Patterns Interesting?
Issues in Data Mining
2 © Copyright 2006, Natasha Balac
- 3. Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia databases, and
Web databases
3 © Copyright 2006, Natasha Balac
- 4. Necessity is the Mother of Invention
Data explosion
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
We are drowning in data, but starving for
knowledge!
4 © Copyright 2006, Natasha Balac
- 5. Necessity is the Mother of Invention
We are drowning in data, but starving for
knowledge!
Solution - Data Mining
Data Warehousing and online analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
5 © Copyright 2006, Natasha Balac
- 6. Data Mining Main Objectives
Identification of data as a source of useful
information
Use of discovered information for
competitive advantages when working in
business environment
6 © Copyright 2006, Natasha Balac
- 7. Why DATA MINING?
Huge amounts of data
Electronic records of our decisions
Choices in the supermarket
Financial records
Our comings and goings
We swipe our way through the world – every
swipe is a record in a database
Data rich – but information poor
Lying hidden in all this data is information!
7 © Copyright 2006, Natasha Balac
- 8. Data vs. Information
Society produces massive amounts of data
business, science, medicine, economics, sports, …
Potentially valuable resource
Raw data is useless
need techniques to automatically extract information
Data: recorded facts (as in databases)
Information: patterns underlying the data
Patterns must be discovered automatically
Data – Information - Knowledge
8 © Copyright 2006, Natasha Balac
- 9. What is DATA MINING?
Extracting or “mining” knowledge from large
amounts of data
Data -driven discovery and modeling of
hidden patterns (we never new existed) in
large volumes of data
Process for Extraction of implicit, previously
unknown and unexpected, potentially
extremely useful information from data
9 © Copyright 2006, Natasha Balac
- 10. What Is Data Mining?
Data mining:
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
Alternative names:
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
10 © Copyright 2006, Natasha Balac
- 11. Data Mining is NOT
Data Warehousing
(Deductive) query processing
SQL/ Reporting
Software Agents
Expert Systems
Online Analytical Processing (OLAP)
Statistical Analysis Tool
Data visualization
11 © Copyright 2006, Natasha Balac
- 12. Data Mining
Programs that detect patterns and rules in
the data
Strong patterns can be used to make non-trivial
predictions on new data
12 © Copyright 2006, Natasha Balac
- 13. Data Mining Challenges
Problem 1: most patterns are not
interesting
Problem 2: patterns may be inexact or
completely spurious when noisy data
present
13 © Copyright 2006, Natasha Balac
- 14. Machine Learning Techniques
Technical basis for data mining: algorithms for
acquiring structural descriptions from examples
Methods originate from artificial intelligence,
statistics, and research on databases
14 © Copyright 2006, Natasha Balac
- 15. Machine Learning Techniques
Structural descriptions represent patterns
explicitly can be used to
predict outcome in new situation
understand and explain how prediction is
derived (maybe even more important)
15 © Copyright 2006, Natasha Balac
- 16. Multidisciplinary Field
Database
Technology
Statistics
Data Mining
Visualization
Other
Disciplines
Machine
Learning
Artificial
Intelligence
16 © Copyright 2006, Natasha Balac
- 17. Multidisciplinary Field
Database technology
Artificial Intelligence
Machine Learning including Neural Networks
Statistics
Pattern recognition
Knowledge-based systems/acquisition
High-performance computing
Data visualization
17 © Copyright 2006, Natasha Balac
- 19. History
Emerged late 1980s
Flourished –1990s
Roots traced back along three family lines
Classical Statistics
Artificial Intelligence
Machine Learning
19 © Copyright 2006, Natasha Balac
- 20. Statistics
Foundation of most DM technologies
Regression analysis, standard
distribution/deviation/variance, cluster
analysis, confidence intervals
Building blocks
Significant role in today’s data mining –
but alone is not powerful enough
20 © Copyright 2006, Natasha Balac
- 21. Artificial Intelligence
Heuristics vs. Statistics
Human-thought-like processing
Requires vast computer processing power
Supercomputers
21 © Copyright 2006, Natasha Balac
- 22. Machine Learning
Union of statistics and AI
Blends AI heuristics with advanced statistical
analysis
Machine Learning – let computer programs
learn about data they study - make different
decisions based on the quality of studied data
using statistics for fundamental concepts and
adding more advanced AI heuristics and algorithms
22 © Copyright 2006, Natasha Balac
- 23. Data Mining
Adoption of the Machine learning techniques
to the real world problems
Union: Statistics, AI, Machine learning
Used to find previously hidden trends or
patterns
Finding increasing acceptance in science and
business areas which need to analyze large
amount of data to discover trends which could
not be found otherwise
23 © Copyright 2006, Natasha Balac
- 24. Terminology
Gold Mining (turn data tombs into golden nuggets of knowledge.
Finding small sets of precious nuggets from a great deal of raw material)
Knowledge mining from databases
Knowledge extraction
Data/pattern analysis
Knowledge Discovery Databases or KDD
Information harvesting
Business intelligence
24 © Copyright 2006, Natasha Balac
- 25. Data Mining: A KDD Process
Data mining: the core of
Pattern Evaluation
Data Mining
Data Mining: Concepts and
Data Warehouse
2Se5ptember 6, 2014 © Copyright 2006, Natasha Balac
Techniques 25
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Task-relevant Data
Selection and Transformation
- 26. Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (remove noise, may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
26 © Copyright 2006, Natasha Balac
- 27. Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Paper, Files, Information Providers, Database Systems, OLTP
September 6, 2014
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Data Mining: Concepts and
Business
Analyst
Data
Analyst
DBA
Techniques 27
27 © Copyright 2006, Natasha Balac
- 28. Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Warehouse
Data Mining: Concepts and
S2ep8tember 6, 2014 © Copyright 2006, Natasha Balac
Techniques 28
Databases
Knowledge-base
Guide search , evaluate
interestingness in patterns
Interestingness measures to
focus search to interesting
patterns
- 29. Data Mining: On What Kind of Data?
Should be applicable to any kind of data repository, and transient data (data streams)
Types of databases and information repositories on which data mining can be
performed
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
29 © Copyright 2006, Natasha Balac
- 30. LEARNING ALGORITHMS
Fundamental idea:
learn rules/patterns/relationships
automatically from the data
30 © Copyright 2006, Natasha Balac
- 31. Data Mining Tasks
Exploratory Data Analysis
Predictive Modeling inference on current data to make predictions
Classification and Regression
Descriptive Modeling characterise general properties of the data
Cluster analysis/segmentation
Discovering Patterns and Rules
Association/Dependency rules
Sequential patterns
Temporal sequences
Deviation detection
31 © Copyright 2006, Natasha Balac
- 32. Data Mining Tasks
Data is associated with classes (eg computers, printers) or concepts (eg customer types)
Concept/Class description: Characterization and discrimination
Generalize, summarize (target class), and contrast data
characteristics, e.g., dry vs. wet regions (contrasting class)
Frequent patterns: patterns occurring frequently in the data
Frequent itemsets: set of items frequently appearing together in transactional data
Association (correlation and causality)
Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”) buys(X, “TV”)
age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]
buys(T, “computer”) buys(T, “software”) [1%, 50%]
Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence
threshold
32 © Copyright 2006, Natasha Balac
- 33. Data Mining Tasks
Classification
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
Example: classify countries based on climate, or classify cars based on
gas mileage
Derived model based on analysis of a set of training data (objects of known
class label)
Model may be represented in various forms:
If-THEN rules, decision-tree, classification rule, neural network
Prediction
Models continuous valued function - Predict some unknown or missing
numerical values ( eg Regression analysis)
Relevance analysis – identify attributes not contributing to classification/prediction,
hence can be excluded
33 © Copyright 2006, Natasha Balac
- 34. Data Mining Tasks
Cluster analysis
Class label is unknown: Group data to form new
classes,
Example: cluster houses to find distribution
patterns
Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
Each cluster formed may be viewed as a class of objects
34 © Copyright 2006, Natasha Balac
- 35. Data Mining Tasks
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
Mostly considered as noise or exception, but is quite useful
in fraud detection, rare events analysis
Trend and evolution analysis model regularities or
trends for objects whose behaviour changes with time
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
35 © Copyright 2006, Natasha Balac
- 36. Data Mining: Classification Schemes
General functionality
Descriptive data mining Vs. Predictive data mining
(DDM – characterise general properties of data. PDM – perform inference on
data in order to make predictions)
Different views - different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques employed
36 © Copyright 2006, Natasha Balac
Kinds of applications
- 37. A Multi-Dimensional View of Data
Mining Classification
Databases to be mined
Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media,
WWW, etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
Multiple/integrated functions
Mining at multiple levels of abstractions
37 © Copyright 2006, Natasha Balac
- 38. A Multi-Dimensional View of Data
Mining Classification
Techniques utilized
Decision/Regression trees, clustering, neural
networks, etc.
Applications adapted
Retail, telecom, banking, DNA mining, stock
market analysis, Web mining
38 © Copyright 2006, Natasha Balac
- 39. Data Mining Applications
Science: Chemistry, Physics, Medicine
Biochemical analysis
Remote sensors on a satellite
Telescopes – star galaxy classification
Medical Image analysis
39 © Copyright 2006, Natasha Balac
- 40. Data Mining Applications
Bioscience
Sequence-based analysis
Protein structure and function prediction
Protein family classification
Microarray gene expression
40 © Copyright 2006, Natasha Balac
- 41. Data Mining Applications
Pharmaceutical companies, Insurance
and Health care, Medicine
Drug development
Identify successful medical therapies
Claims analysis, fraudulent behavior
Medical diagnostic tools
Predict office visits
41 © Copyright 2006, Natasha Balac
- 42. Data Mining Applications
Financial Industry, Banks, Businesses, E-commerce
Stock and investment analysis
Identify loyal customers vs. risky customer
Predict customer spending
Risk management
Sales forecasting
42 © Copyright 2006, Natasha Balac
- 43. Data Mining Applications
Retail and Marketing
Customer buying patterns/demographic
characteristics
Mailing campaigns
Market basket analysis
Trend analysis
43 © Copyright 2006, Natasha Balac
- 44. Data Mining Applications
Database analysis and decision support
Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
44 © Copyright 2006, Natasha Balac
- 45. Data Mining Applications
Sports and Entertainment
IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to
gain competitive advantage for New York
Knicks and Miami Heat
Astronomy
JPL and the Palomar Observatory discovered
22 quasars with the help of data mining
45 © Copyright 2006, Natasha Balac
- 46. DATA MINING EXAMPLES
Grocery store
NBA
Banking and Credit Card scoring
Fraud detection
Personalization & Customer Profiling
Campaign Management and Database
Marketing
46 © Copyright 2006, Natasha Balac
- 47. Data mining at work:
Case study 1
47 © Copyright 2006, Natasha Balac
- 48. Processing Loan Applications
Given: questionnaire with financial and personal
information
Problem: should money be lent?
Borderline cases referred to loan officers
But: 50% of accepted borderline cases defaulted!
Solution:
reject all borderline cases?
Borderline cases are most active customers!
48 © Copyright 2006, Natasha Balac
- 49. Enter Machine Learning
Given:
1000 training examples of borderline cases
20 attributes:
age, years with current employer,years at current address,
years with the bank, years at current job, other credit cards
Learned rules predicted 2/3 of borderline cases
correctly!
Rules could be used to explain decisions to customers
49 © Copyright 2006, Natasha Balac
- 50. Case study 2:
Screening images
Given:
radar satellite images of coastal waters
Problem:
detecting oil slicks in those images
Oil slicks = dark regions with changing size
and shape
Look-alike dark regions can be caused by
weather conditions (e.g. high wind)
Expensive process requiring highly trained
personnel
50 © Copyright 2006, Natasha Balac
- 51. Enter Machine Learning
Dark regions extracted from normalized image
Attributes:
size of region, shape, area, intensity, sharpness
and jaggedness of boundaries, proximity of other
regions, info about background
Constraints:
Scarcity of training examples (oil slicks are rare!)
Unbalanced data: most dark regions aren’t oil
slicks
Regions from same image form a batch
Requirement is adjustable false-alarm rate
51 © Copyright 2006, Natasha Balac
- 52. Data Mining Challenges
Computationally expensive to investigate all
possibilities
Dealing with noise/missing information and
errors in data
Choosing appropriate attributes/input
representation
Finding the minimal attribute space
Finding adequate evaluation function(s)
Extracting meaningful information
Not overfitting
52 © Copyright 2006, Natasha Balac
- 53. Are All the “Discovered” Patterns
Interesting?
DM system may generate thousands/millions of patterns/rules
Interestingness measures: A pattern is interesting if it is easily understood
by humans, valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a user seeks to
confirm
support: %age of transactions the given rule satisfies
support( X => Y ) = P ( X U Y )
confidence: degree of certainty of detected association
confidence(X => Y ) = P ( Y | X )
53 © Copyright 2006, Natasha Balac
- 54. Are All the “Discovered” Patterns
Interesting?
Objective vs. subjective measures:
Objective: based on statistics and structures of
patterns
support and confidence
Subjective: based on user’s belief in the data
unexpectedness, novelty, actionability, confirm
hypothesis user wishes to validate or resemble hunch
54 © Copyright 2006, Natasha Balac
- 55. Can We Find All and Only Interesting
Patterns?
An Optimisation problem – completeness of DM algo
Completeness - Find all the interesting
patterns
Can a data mining system find all the interesting
patterns?
Association vs. classification vs. clustering
55 © Copyright 2006, Natasha Balac
- 56. Can We Find All and Only Interesting
Patterns?
Optimization - Search for only interesting patterns
Can a data mining system find only the interesting
patterns?
Approaches
First generate all the patterns and then filter out the
uninteresting ones
Mining query optimization (improve search efficiency by pruning
away subsets of the pattern space not satisfying interestingness constraints)
56 © Copyright 2006, Natasha Balac
- 57. Major Issues in Data Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Incorporation of background knowledge allow discovered
patterns to be expressed in concise terms at different levels of abstraction
Handling noise and incomplete data may confuse the process
Pattern evaluation: the interestingness problem
Expression and visualization of data mining results
in expressive forms so that knowledge is easily understood, usable by humans
57 © Copyright 2006, Natasha Balac
- 58. Major Issues in Data Mining
Performance and scalability huge data volumes
Efficiency of data mining algorithms
Parallel, distributed and incremental mining
methods
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from diverse databases
Unrealistic to expect one system to mine all kinds of data
58 © Copyright 2006, Natasha Balac
- 59. Major Issues in Data Mining
Issues related to applications and social
impacts
Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Expert systems
Process control and decision making
A knowledge fusion problem
Protection of data security, integrity, and privacy
59 © Copyright 2006, Natasha Balac
- 60. Summary
Data mining: discovering interesting patterns
from large amounts of data
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
60 © Copyright 2006, Natasha Balac
- 61. Summary
Mining can be performed in a variety of
information repositories
Data mining functionalities: characterization,
association, classification, clustering, outlier
and trend analysis, etc.
Classification of data mining systems
Major issues in data mining
61 © Copyright 2006, Natasha Balac
- 62. Kinds of Data Mining
Decision Tree Learning
Clustering
Neural Networks
Association Rules
Support Vector Machines
Genetic Algorithms
Nearest Neighbor Method
62 © Copyright 2006, Natasha Balac
- 64. DECISION TREE FOR THE CONCEPT
“Play Tennis”
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
Mitchell, 1D99174 Rain Mild High Strong No
64 © Copyright 2006, Natasha Balac
- 65. DECISION TREE FOR THE CONCEPT
“Play Tennis”
[Mitchell,1997]
65 © Copyright 2006, Natasha Balac