DM Lecture 2

1
Introduction to Data Mining
Credit: Natasha Balac, Ph.D.

Outline
 Motivation: Why Data Mining?
 What is Data Mining?
 History of Data Mining
 Data Mining Functionality and Terminology
 Data Mining Applications
 Are all the Patterns Interesting?
 Issues in Data Mining
2 © Copyright 2006, Natasha Balac

Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,
scientific, engineering, etc.)
 1990s—2000s:
 Data mining and data warehousing, multimedia databases, and
Web databases

Necessity is the Mother of Invention
 Data explosion
 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
 We are drowning in data, but starving for
knowledge!

Necessity is the Mother of Invention
 We are drowning in data, but starving for
knowledge!
 Solution - Data Mining
 Data Warehousing and online analytical processing
 Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases

Data Mining Main Objectives
 Identification of data as a source of useful
information
 Use of discovered information for
competitive advantages when working in
business environment

Why DATA MINING?
 Huge amounts of data
 Electronic records of our decisions
 Choices in the supermarket
 Financial records
 Our comings and goings
 We swipe our way through the world – every
swipe is a record in a database
 Data rich – but information poor
 Lying hidden in all this data is information!

Data vs. Information
 Society produces massive amounts of data
 business, science, medicine, economics, sports, …
 Potentially valuable resource
 Raw data is useless
 need techniques to automatically extract information
 Data: recorded facts (as in databases)
 Information: patterns underlying the data
 Patterns must be discovered automatically
Data – Information - Knowledge

What is DATA MINING?
 Extracting or “mining” knowledge from large
amounts of data
 Data -driven discovery and modeling of
hidden patterns (we never new existed) in
large volumes of data
 Process for Extraction of implicit, previously
unknown and unexpected, potentially
extremely useful information from data

What Is Data Mining?
 Data mining:
 Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
Alternative names:
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.

Data Mining is NOT
 Data Warehousing
 (Deductive) query processing
 SQL/ Reporting
 Software Agents
 Expert Systems
 Online Analytical Processing (OLAP)
 Statistical Analysis Tool
 Data visualization

Data Mining
 Programs that detect patterns and rules in
the data
 Strong patterns can be used to make non-trivial
predictions on new data

Data Mining Challenges
 Problem 1: most patterns are not
interesting
 Problem 2: patterns may be inexact or
completely spurious when noisy data
present

Machine Learning Techniques
 Technical basis for data mining: algorithms for
acquiring structural descriptions from examples
 Methods originate from artificial intelligence,
statistics, and research on databases

Machine Learning Techniques
 Structural descriptions represent patterns
explicitly can be used to
 predict outcome in new situation
 understand and explain how prediction is
derived (maybe even more important)

Multidisciplinary Field
Database
Technology
Statistics
Data Mining
Visualization
Other
Disciplines
Machine
Learning
Artificial
Intelligence

Multidisciplinary Field
 Database technology
 Artificial Intelligence
 Machine Learning including Neural Networks
 Statistics
 Pattern recognition
 Knowledge-based systems/acquisition
 High-performance computing
 Data visualization

History of Data Mining

History
 Emerged late 1980s
 Flourished –1990s
 Roots traced back along three family lines
 Classical Statistics
 Artificial Intelligence
 Machine Learning

Statistics
 Foundation of most DM technologies
 Regression analysis, standard
distribution/deviation/variance, cluster
analysis, confidence intervals
 Building blocks
 Significant role in today’s data mining –
but alone is not powerful enough

Artificial Intelligence
 Heuristics vs. Statistics
 Human-thought-like processing
 Requires vast computer processing power
 Supercomputers

Machine Learning
 Union of statistics and AI
 Blends AI heuristics with advanced statistical
analysis
 Machine Learning – let computer programs
 learn about data they study - make different
decisions based on the quality of studied data
 using statistics for fundamental concepts and
adding more advanced AI heuristics and algorithms

Data Mining
 Adoption of the Machine learning techniques
to the real world problems
 Union: Statistics, AI, Machine learning
 Used to find previously hidden trends or
patterns
 Finding increasing acceptance in science and
business areas which need to analyze large
amount of data to discover trends which could
not be found otherwise

Terminology
 Gold Mining (turn data tombs into golden nuggets of knowledge.
Finding small sets of precious nuggets from a great deal of raw material)
 Knowledge mining from databases
 Knowledge extraction
 Data/pattern analysis
 Knowledge Discovery Databases or KDD
 Information harvesting
 Business intelligence

Data Mining: A KDD Process
 Data mining: the core of
Pattern Evaluation
Data Mining
Data Mining: Concepts and
Data Warehouse
2Se5ptember 6, 2014 © Copyright 2006, Natasha Balac
Techniques 25
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Task-relevant Data
Selection and Transformation

Steps of a KDD Process
 Learning the application domain:
 relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (remove noise, may take 60% of effort!)
 Data reduction and transformation:
 Find useful features, dimensionality/variable reduction, invariant
representation.
 Choosing functions of data mining
 summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge

Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Paper, Files, Information Providers, Database Systems, OLTP
September 6, 2014
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Business
Analyst
Data
Analyst
DBA
Techniques 27

Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Database or data
warehouse server
Data cleaning & data integration Filtering
Data
Warehouse
S2ep8tember 6, 2014 © Copyright 2006, Natasha Balac
Techniques 28
Databases
Knowledge-base
Guide search , evaluate
interestingness in patterns
Interestingness measures to
focus search to interesting
patterns

Data Mining: On What Kind of Data?
Should be applicable to any kind of data repository, and transient data (data streams)
Types of databases and information repositories on which data mining can be
performed
 Relational databases
 Data warehouses
 Transactional databases
 Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW

LEARNING ALGORITHMS
 Fundamental idea:
learn rules/patterns/relationships
automatically from the data

Data Mining Tasks
 Exploratory Data Analysis
 Predictive Modeling inference on current data to make predictions
 Classification and Regression
 Descriptive Modeling characterise general properties of the data
 Cluster analysis/segmentation
 Discovering Patterns and Rules
 Association/Dependency rules
 Sequential patterns
 Temporal sequences
 Deviation detection

Data Mining Tasks
Data is associated with classes (eg computers, printers) or concepts (eg customer types)
 Concept/Class description: Characterization and discrimination
 Generalize, summarize (target class), and contrast data
characteristics, e.g., dry vs. wet regions (contrasting class)
Frequent patterns: patterns occurring frequently in the data
Frequent itemsets: set of items frequently appearing together in transactional data
 Association (correlation and causality)
 Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)
age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%]
buys(T, “computer”)  buys(T, “software”) [1%, 50%]
Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence
threshold

Data Mining Tasks
 Classification
 Finding models (functions) that describe and distinguish classes or
concepts for future prediction
 Example: classify countries based on climate, or classify cars based on
gas mileage
Derived model based on analysis of a set of training data (objects of known
class label)
 Model may be represented in various forms:
 If-THEN rules, decision-tree, classification rule, neural network
 Prediction
 Models continuous valued function - Predict some unknown or missing
numerical values ( eg Regression analysis)
Relevance analysis – identify attributes not contributing to classification/prediction,
hence can be excluded

Data Mining Tasks
 Cluster analysis
 Class label is unknown: Group data to form new
classes,
 Example: cluster houses to find distribution
patterns
 Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
Each cluster formed may be viewed as a class of objects

Data Mining Tasks
 Outlier analysis
 Outlier: a data object that does not comply with the general
behavior of the data
 Mostly considered as noise or exception, but is quite useful
in fraud detection, rare events analysis
 Trend and evolution analysis model regularities or
trends for objects whose behaviour changes with time
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis

Data Mining: Classification Schemes
 General functionality
 Descriptive data mining Vs. Predictive data mining
(DDM – characterise general properties of data. PDM – perform inference on
data in order to make predictions)
 Different views - different classifications
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques employed
 Kinds of applications

A Multi-Dimensional View of Data
Mining Classification
 Databases to be mined
 Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media,
WWW, etc.
 Knowledge to be mined
 Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
 Multiple/integrated functions
 Mining at multiple levels of abstractions

A Multi-Dimensional View of Data
Mining Classification
 Techniques utilized
 Decision/Regression trees, clustering, neural
networks, etc.
 Applications adapted
 Retail, telecom, banking, DNA mining, stock
market analysis, Web mining

Data Mining Applications
 Science: Chemistry, Physics, Medicine
 Biochemical analysis
 Remote sensors on a satellite
 Telescopes – star galaxy classification
 Medical Image analysis

 Bioscience
 Sequence-based analysis
 Protein structure and function prediction
 Protein family classification
 Microarray gene expression

 Pharmaceutical companies, Insurance
and Health care, Medicine
 Drug development
 Identify successful medical therapies
 Claims analysis, fraudulent behavior
 Medical diagnostic tools
 Predict office visits

 Financial Industry, Banks, Businesses, E-commerce
 Stock and investment analysis
 Identify loyal customers vs. risky customer
 Predict customer spending
 Risk management
 Sales forecasting

 Retail and Marketing
 Customer buying patterns/demographic
characteristics
 Mailing campaigns
 Market basket analysis
 Trend analysis

 Database analysis and decision support
 Market analysis and management
 target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
 Fraud detection and management

 Sports and Entertainment
 IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to
gain competitive advantage for New York
Knicks and Miami Heat
 Astronomy
 JPL and the Palomar Observatory discovered
22 quasars with the help of data mining

DATA MINING EXAMPLES
 Grocery store
 NBA
 Banking and Credit Card scoring
 Fraud detection
 Personalization & Customer Profiling
 Campaign Management and Database
Marketing

Data mining at work:
Case study 1

Processing Loan Applications
 Given: questionnaire with financial and personal
information
 Problem: should money be lent?
 Borderline cases referred to loan officers
 But: 50% of accepted borderline cases defaulted!
 Solution:
 reject all borderline cases?
 Borderline cases are most active customers!

Enter Machine Learning
 Given:
 1000 training examples of borderline cases
 20 attributes:
 age, years with current employer,years at current address,
years with the bank, years at current job, other credit cards
 Learned rules predicted 2/3 of borderline cases
correctly!
 Rules could be used to explain decisions to customers

Case study 2:
Screening images
 Given:
 radar satellite images of coastal waters
 Problem:
 detecting oil slicks in those images
 Oil slicks = dark regions with changing size
and shape
 Look-alike dark regions can be caused by
weather conditions (e.g. high wind)
 Expensive process requiring highly trained
personnel

Enter Machine Learning
 Dark regions extracted from normalized image
 Attributes:
 size of region, shape, area, intensity, sharpness
and jaggedness of boundaries, proximity of other
regions, info about background
 Constraints:
 Scarcity of training examples (oil slicks are rare!)
 Unbalanced data: most dark regions aren’t oil
slicks
 Regions from same image form a batch
 Requirement is adjustable false-alarm rate

Data Mining Challenges
 Computationally expensive to investigate all
possibilities
 Dealing with noise/missing information and
errors in data
 Choosing appropriate attributes/input
representation
 Finding the minimal attribute space
 Finding adequate evaluation function(s)
 Extracting meaningful information
 Not overfitting

Are All the “Discovered” Patterns
Interesting?
DM system may generate thousands/millions of patterns/rules
 Interestingness measures: A pattern is interesting if it is easily understood
by humans, valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a user seeks to
confirm
support: %age of transactions the given rule satisfies
support( X => Y ) = P ( X U Y )
confidence: degree of certainty of detected association
confidence(X => Y ) = P ( Y | X )

Are All the “Discovered” Patterns
Interesting?
 Objective vs. subjective measures:
 Objective: based on statistics and structures of
patterns
 support and confidence
 Subjective: based on user’s belief in the data
 unexpectedness, novelty, actionability, confirm
hypothesis user wishes to validate or resemble hunch

Can We Find All and Only Interesting
Patterns?
An Optimisation problem – completeness of DM algo
 Completeness - Find all the interesting
patterns
 Can a data mining system find all the interesting
patterns?
 Association vs. classification vs. clustering

Can We Find All and Only Interesting
Patterns?
 Optimization - Search for only interesting patterns
 Can a data mining system find only the interesting
patterns?
 Approaches
 First generate all the patterns and then filter out the
uninteresting ones
 Mining query optimization (improve search efficiency by pruning
away subsets of the pattern space not satisfying interestingness constraints)

Major Issues in Data Mining
 Mining methodology and user interaction
 Mining different kinds of knowledge in databases
 Incorporation of background knowledge allow discovered
patterns to be expressed in concise terms at different levels of abstraction
 Handling noise and incomplete data may confuse the process
 Pattern evaluation: the interestingness problem
 Expression and visualization of data mining results
in expressive forms so that knowledge is easily understood, usable by humans

 Performance and scalability huge data volumes
 Efficiency of data mining algorithms
 Parallel, distributed and incremental mining
methods
 Issues relating to the diversity of data types
 Handling relational and complex types of data
 Mining information from diverse databases
Unrealistic to expect one system to mine all kinds of data

 Issues related to applications and social
impacts
 Application of discovered knowledge
 Domain-specific data mining tools
 Intelligent query answering
 Expert systems
 Process control and decision making
 A knowledge fusion problem
 Protection of data security, integrity, and privacy

Summary
 Data mining: discovering interesting patterns
from large amounts of data
 A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation

Summary
 Mining can be performed in a variety of
information repositories
 Data mining functionalities: characterization,
association, classification, clustering, outlier
and trend analysis, etc.
 Classification of data mining systems
 Major issues in data mining

Kinds of Data Mining
 Decision Tree Learning
 Clustering
 Neural Networks
 Association Rules
 Support Vector Machines
 Genetic Algorithms
 Nearest Neighbor Method

Decision Tree Example
Grandparents
A lot
A little

DECISION TREE FOR THE CONCEPT
“Play Tennis”
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
Mitchell, 1D99174 Rain Mild High Strong No

DECISION TREE FOR THE CONCEPT
“Play Tennis”
[Mitchell,1997]

DM Lecture 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to DM Lecture 2

Similar to DM Lecture 2 (20)

Recently uploaded

Recently uploaded (20)

DM Lecture 2