SlideShare a Scribd company logo
1 of 65
1 
Introduction to Data Mining 
Credit: Natasha Balac, Ph.D.
Outline 
 Motivation: Why Data Mining? 
 What is Data Mining? 
 History of Data Mining 
 Data Mining Functionality and Terminology 
 Data Mining Applications 
 Are all the Patterns Interesting? 
 Issues in Data Mining 
2 © Copyright 2006, Natasha Balac
Evolution of Database Technology 
 1960s: 
 Data collection, database creation, IMS and network DBMS 
 1970s: 
 Relational data model, relational DBMS implementation 
 1980s: 
 RDBMS, advanced data models (extended-relational, OO, 
deductive, etc.) and application-oriented DBMS (spatial, 
scientific, engineering, etc.) 
 1990s—2000s: 
 Data mining and data warehousing, multimedia databases, and 
Web databases 
3 © Copyright 2006, Natasha Balac
Necessity is the Mother of Invention 
 Data explosion 
 Automated data collection tools and mature database 
technology lead to tremendous amounts of data stored in 
databases, data warehouses and other information 
repositories 
 We are drowning in data, but starving for 
knowledge! 
4 © Copyright 2006, Natasha Balac
Necessity is the Mother of Invention 
 We are drowning in data, but starving for 
knowledge! 
 Solution - Data Mining 
 Data Warehousing and online analytical processing 
 Extraction of interesting knowledge (rules, regularities, 
patterns, constraints) from data in large databases 
5 © Copyright 2006, Natasha Balac
Data Mining Main Objectives 
 Identification of data as a source of useful 
information 
 Use of discovered information for 
competitive advantages when working in 
business environment 
6 © Copyright 2006, Natasha Balac
Why DATA MINING? 
 Huge amounts of data 
 Electronic records of our decisions 
 Choices in the supermarket 
 Financial records 
 Our comings and goings 
 We swipe our way through the world – every 
swipe is a record in a database 
 Data rich – but information poor 
 Lying hidden in all this data is information! 
7 © Copyright 2006, Natasha Balac
Data vs. Information 
 Society produces massive amounts of data 
 business, science, medicine, economics, sports, … 
 Potentially valuable resource 
 Raw data is useless 
 need techniques to automatically extract information 
 Data: recorded facts (as in databases) 
 Information: patterns underlying the data 
 Patterns must be discovered automatically 
Data – Information - Knowledge 
8 © Copyright 2006, Natasha Balac
What is DATA MINING? 
 Extracting or “mining” knowledge from large 
amounts of data 
 Data -driven discovery and modeling of 
hidden patterns (we never new existed) in 
large volumes of data 
 Process for Extraction of implicit, previously 
unknown and unexpected, potentially 
extremely useful information from data 
9 © Copyright 2006, Natasha Balac
What Is Data Mining? 
 Data mining: 
 Extraction of interesting (non-trivial, implicit, 
previously unknown and potentially useful) 
information or patterns from data in large databases 
Alternative names: 
Data mining: a misnomer? 
Knowledge discovery(mining) in databases (KDD), knowledge extraction, 
data/pattern analysis, data archeology, data dredging, information harvesting, 
business intelligence, etc. 
10 © Copyright 2006, Natasha Balac
Data Mining is NOT 
 Data Warehousing 
 (Deductive) query processing 
 SQL/ Reporting 
 Software Agents 
 Expert Systems 
 Online Analytical Processing (OLAP) 
 Statistical Analysis Tool 
 Data visualization 
11 © Copyright 2006, Natasha Balac
Data Mining 
 Programs that detect patterns and rules in 
the data 
 Strong patterns can be used to make non-trivial 
predictions on new data 
12 © Copyright 2006, Natasha Balac
Data Mining Challenges 
 Problem 1: most patterns are not 
interesting 
 Problem 2: patterns may be inexact or 
completely spurious when noisy data 
present 
13 © Copyright 2006, Natasha Balac
Machine Learning Techniques 
 Technical basis for data mining: algorithms for 
acquiring structural descriptions from examples 
 Methods originate from artificial intelligence, 
statistics, and research on databases 
14 © Copyright 2006, Natasha Balac
Machine Learning Techniques 
 Structural descriptions represent patterns 
explicitly can be used to 
 predict outcome in new situation 
 understand and explain how prediction is 
derived (maybe even more important) 
15 © Copyright 2006, Natasha Balac
Multidisciplinary Field 
Database 
Technology 
Statistics 
Data Mining 
Visualization 
Other 
Disciplines 
Machine 
Learning 
Artificial 
Intelligence 
16 © Copyright 2006, Natasha Balac
Multidisciplinary Field 
 Database technology 
 Artificial Intelligence 
 Machine Learning including Neural Networks 
 Statistics 
 Pattern recognition 
 Knowledge-based systems/acquisition 
 High-performance computing 
 Data visualization 
17 © Copyright 2006, Natasha Balac
History of Data Mining 
18 © Copyright 2006, Natasha Balac
History 
 Emerged late 1980s 
 Flourished –1990s 
 Roots traced back along three family lines 
 Classical Statistics 
 Artificial Intelligence 
 Machine Learning 
19 © Copyright 2006, Natasha Balac
Statistics 
 Foundation of most DM technologies 
 Regression analysis, standard 
distribution/deviation/variance, cluster 
analysis, confidence intervals 
 Building blocks 
 Significant role in today’s data mining – 
but alone is not powerful enough 
20 © Copyright 2006, Natasha Balac
Artificial Intelligence 
 Heuristics vs. Statistics 
 Human-thought-like processing 
 Requires vast computer processing power 
 Supercomputers 
21 © Copyright 2006, Natasha Balac
Machine Learning 
 Union of statistics and AI 
 Blends AI heuristics with advanced statistical 
analysis 
 Machine Learning – let computer programs 
 learn about data they study - make different 
decisions based on the quality of studied data 
 using statistics for fundamental concepts and 
adding more advanced AI heuristics and algorithms 
22 © Copyright 2006, Natasha Balac
Data Mining 
 Adoption of the Machine learning techniques 
to the real world problems 
 Union: Statistics, AI, Machine learning 
 Used to find previously hidden trends or 
patterns 
 Finding increasing acceptance in science and 
business areas which need to analyze large 
amount of data to discover trends which could 
not be found otherwise 
23 © Copyright 2006, Natasha Balac
Terminology 
 Gold Mining (turn data tombs into golden nuggets of knowledge. 
Finding small sets of precious nuggets from a great deal of raw material) 
 Knowledge mining from databases 
 Knowledge extraction 
 Data/pattern analysis 
 Knowledge Discovery Databases or KDD 
 Information harvesting 
 Business intelligence 
24 © Copyright 2006, Natasha Balac
Data Mining: A KDD Process 
 Data mining: the core of 
Pattern Evaluation 
Data Mining 
Data Mining: Concepts and 
Data Warehouse 
2Se5ptember 6, 2014 © Copyright 2006, Natasha Balac 
Techniques 25 
knowledge discovery 
process. 
Data Cleaning 
Data Integration 
Databases 
Task-relevant Data 
Selection and Transformation
Steps of a KDD Process 
 Learning the application domain: 
 relevant prior knowledge and goals of application 
 Creating a target data set: data selection 
 Data cleaning and preprocessing: (remove noise, may take 60% of effort!) 
 Data reduction and transformation: 
 Find useful features, dimensionality/variable reduction, invariant 
representation. 
 Choosing functions of data mining 
 summarization, classification, regression, association, clustering. 
 Choosing the mining algorithm(s) 
 Data mining: search for patterns of interest 
 Pattern evaluation and knowledge presentation 
 visualization, transformation, removing redundant patterns, etc. 
 Use of discovered knowledge 
26 © Copyright 2006, Natasha Balac
Data Mining and Business Intelligence 
Increasing potential 
to support 
business decisions End User 
Paper, Files, Information Providers, Database Systems, OLTP 
September 6, 2014 
Making 
Decisions 
Data Presentation 
Visualization Techniques 
Data Mining 
Information Discovery 
Data Exploration 
Statistical Analysis, Querying and Reporting 
Data Warehouses / Data Marts 
OLAP, MDA 
Data Sources 
Data Mining: Concepts and 
Business 
Analyst 
Data 
Analyst 
DBA 
Techniques 27 
27 © Copyright 2006, Natasha Balac
Architecture of a Typical Data Mining System 
Graphical user interface 
Pattern evaluation 
Data mining engine 
Database or data 
warehouse server 
Data cleaning & data integration Filtering 
Data 
Warehouse 
Data Mining: Concepts and 
S2ep8tember 6, 2014 © Copyright 2006, Natasha Balac 
Techniques 28 
Databases 
Knowledge-base 
Guide search , evaluate 
interestingness in patterns 
Interestingness measures to 
focus search to interesting 
patterns
Data Mining: On What Kind of Data? 
Should be applicable to any kind of data repository, and transient data (data streams) 
Types of databases and information repositories on which data mining can be 
performed 
 Relational databases 
 Data warehouses 
 Transactional databases 
 Advanced DB and information repositories 
 Object-oriented and object-relational databases 
 Spatial databases 
 Time-series data and temporal data 
 Text databases and multimedia databases 
 Heterogeneous and legacy databases 
 WWW 
29 © Copyright 2006, Natasha Balac
LEARNING ALGORITHMS 
 Fundamental idea: 
learn rules/patterns/relationships 
automatically from the data 
30 © Copyright 2006, Natasha Balac
Data Mining Tasks 
 Exploratory Data Analysis 
 Predictive Modeling inference on current data to make predictions 
 Classification and Regression 
 Descriptive Modeling characterise general properties of the data 
 Cluster analysis/segmentation 
 Discovering Patterns and Rules 
 Association/Dependency rules 
 Sequential patterns 
 Temporal sequences 
 Deviation detection 
31 © Copyright 2006, Natasha Balac
Data Mining Tasks 
Data is associated with classes (eg computers, printers) or concepts (eg customer types) 
 Concept/Class description: Characterization and discrimination 
 Generalize, summarize (target class), and contrast data 
characteristics, e.g., dry vs. wet regions (contrasting class) 
Frequent patterns: patterns occurring frequently in the data 
Frequent itemsets: set of items frequently appearing together in transactional data 
 Association (correlation and causality) 
 Multi-dimensional or single-dimensional association 
age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”) 
age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%] 
buys(T, “computer”)  buys(T, “software”) [1%, 50%] 
Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence 
threshold 
32 © Copyright 2006, Natasha Balac
Data Mining Tasks 
 Classification 
 Finding models (functions) that describe and distinguish classes or 
concepts for future prediction 
 Example: classify countries based on climate, or classify cars based on 
gas mileage 
Derived model based on analysis of a set of training data (objects of known 
class label) 
 Model may be represented in various forms: 
 If-THEN rules, decision-tree, classification rule, neural network 
 Prediction 
 Models continuous valued function - Predict some unknown or missing 
numerical values ( eg Regression analysis) 
Relevance analysis – identify attributes not contributing to classification/prediction, 
hence can be excluded 
33 © Copyright 2006, Natasha Balac
Data Mining Tasks 
 Cluster analysis 
 Class label is unknown: Group data to form new 
classes, 
 Example: cluster houses to find distribution 
patterns 
 Clustering based on the principle: maximizing the 
intra-class similarity and minimizing the interclass 
similarity 
Each cluster formed may be viewed as a class of objects 
34 © Copyright 2006, Natasha Balac
Data Mining Tasks 
 Outlier analysis 
 Outlier: a data object that does not comply with the general 
behavior of the data 
 Mostly considered as noise or exception, but is quite useful 
in fraud detection, rare events analysis 
 Trend and evolution analysis model regularities or 
trends for objects whose behaviour changes with time 
 Trend and deviation: regression analysis 
 Sequential pattern mining, periodicity analysis 
35 © Copyright 2006, Natasha Balac
Data Mining: Classification Schemes 
 General functionality 
 Descriptive data mining Vs. Predictive data mining 
(DDM – characterise general properties of data. PDM – perform inference on 
data in order to make predictions) 
 Different views - different classifications 
 Kinds of databases to be mined 
 Kinds of knowledge to be discovered 
 Kinds of techniques employed 
36 © Copyright 2006, Natasha Balac 
 Kinds of applications
A Multi-Dimensional View of Data 
Mining Classification 
 Databases to be mined 
 Relational, transactional, object-oriented, object-relational, 
active, spatial, time-series, text, multi-media, 
WWW, etc. 
 Knowledge to be mined 
 Characterization, discrimination, association, 
classification, clustering, trend, deviation and outlier 
analysis, etc. 
 Multiple/integrated functions 
 Mining at multiple levels of abstractions 
37 © Copyright 2006, Natasha Balac
A Multi-Dimensional View of Data 
Mining Classification 
 Techniques utilized 
 Decision/Regression trees, clustering, neural 
networks, etc. 
 Applications adapted 
 Retail, telecom, banking, DNA mining, stock 
market analysis, Web mining 
38 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Science: Chemistry, Physics, Medicine 
 Biochemical analysis 
 Remote sensors on a satellite 
 Telescopes – star galaxy classification 
 Medical Image analysis 
39 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Bioscience 
 Sequence-based analysis 
 Protein structure and function prediction 
 Protein family classification 
 Microarray gene expression 
40 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Pharmaceutical companies, Insurance 
and Health care, Medicine 
 Drug development 
 Identify successful medical therapies 
 Claims analysis, fraudulent behavior 
 Medical diagnostic tools 
 Predict office visits 
41 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Financial Industry, Banks, Businesses, E-commerce 
 Stock and investment analysis 
 Identify loyal customers vs. risky customer 
 Predict customer spending 
 Risk management 
 Sales forecasting 
42 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Retail and Marketing 
 Customer buying patterns/demographic 
characteristics 
 Mailing campaigns 
 Market basket analysis 
 Trend analysis 
43 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Database analysis and decision support 
 Market analysis and management 
 target marketing, customer relation management, market 
basket analysis, cross selling, market segmentation 
 Risk analysis and management 
 Forecasting, customer retention, improved underwriting, 
quality control, competitive analysis 
 Fraud detection and management 
44 © Copyright 2006, Natasha Balac
Data Mining Applications 
 Sports and Entertainment 
 IBM Advanced Scout analyzed NBA game 
statistics (shots blocked, assists, and fouls) to 
gain competitive advantage for New York 
Knicks and Miami Heat 
 Astronomy 
 JPL and the Palomar Observatory discovered 
22 quasars with the help of data mining 
45 © Copyright 2006, Natasha Balac
DATA MINING EXAMPLES 
 Grocery store 
 NBA 
 Banking and Credit Card scoring 
 Fraud detection 
 Personalization & Customer Profiling 
 Campaign Management and Database 
Marketing 
46 © Copyright 2006, Natasha Balac
Data mining at work: 
Case study 1 
47 © Copyright 2006, Natasha Balac
Processing Loan Applications 
 Given: questionnaire with financial and personal 
information 
 Problem: should money be lent? 
 Borderline cases referred to loan officers 
 But: 50% of accepted borderline cases defaulted! 
 Solution: 
 reject all borderline cases? 
 Borderline cases are most active customers! 
48 © Copyright 2006, Natasha Balac
Enter Machine Learning 
 Given: 
 1000 training examples of borderline cases 
 20 attributes: 
 age, years with current employer,years at current address, 
years with the bank, years at current job, other credit cards 
 Learned rules predicted 2/3 of borderline cases 
correctly! 
 Rules could be used to explain decisions to customers 
49 © Copyright 2006, Natasha Balac
Case study 2: 
Screening images 
 Given: 
 radar satellite images of coastal waters 
 Problem: 
 detecting oil slicks in those images 
 Oil slicks = dark regions with changing size 
and shape 
 Look-alike dark regions can be caused by 
weather conditions (e.g. high wind) 
 Expensive process requiring highly trained 
personnel 
50 © Copyright 2006, Natasha Balac
Enter Machine Learning 
 Dark regions extracted from normalized image 
 Attributes: 
 size of region, shape, area, intensity, sharpness 
and jaggedness of boundaries, proximity of other 
regions, info about background 
 Constraints: 
 Scarcity of training examples (oil slicks are rare!) 
 Unbalanced data: most dark regions aren’t oil 
slicks 
 Regions from same image form a batch 
 Requirement is adjustable false-alarm rate 
51 © Copyright 2006, Natasha Balac
Data Mining Challenges 
 Computationally expensive to investigate all 
possibilities 
 Dealing with noise/missing information and 
errors in data 
 Choosing appropriate attributes/input 
representation 
 Finding the minimal attribute space 
 Finding adequate evaluation function(s) 
 Extracting meaningful information 
 Not overfitting 
52 © Copyright 2006, Natasha Balac
Are All the “Discovered” Patterns 
Interesting? 
DM system may generate thousands/millions of patterns/rules 
 Interestingness measures: A pattern is interesting if it is easily understood 
by humans, valid on new or test data with some degree of certainty, 
potentially useful, novel, or validates some hypothesis that a user seeks to 
confirm 
support: %age of transactions the given rule satisfies 
support( X => Y ) = P ( X U Y ) 
confidence: degree of certainty of detected association 
confidence(X => Y ) = P ( Y | X ) 
53 © Copyright 2006, Natasha Balac
Are All the “Discovered” Patterns 
Interesting? 
 Objective vs. subjective measures: 
 Objective: based on statistics and structures of 
patterns 
 support and confidence 
 Subjective: based on user’s belief in the data 
 unexpectedness, novelty, actionability, confirm 
hypothesis user wishes to validate or resemble hunch 
54 © Copyright 2006, Natasha Balac
Can We Find All and Only Interesting 
Patterns? 
An Optimisation problem – completeness of DM algo 
 Completeness - Find all the interesting 
patterns 
 Can a data mining system find all the interesting 
patterns? 
 Association vs. classification vs. clustering 
55 © Copyright 2006, Natasha Balac
Can We Find All and Only Interesting 
Patterns? 
 Optimization - Search for only interesting patterns 
 Can a data mining system find only the interesting 
patterns? 
 Approaches 
 First generate all the patterns and then filter out the 
uninteresting ones 
 Mining query optimization (improve search efficiency by pruning 
away subsets of the pattern space not satisfying interestingness constraints) 
56 © Copyright 2006, Natasha Balac
Major Issues in Data Mining 
 Mining methodology and user interaction 
 Mining different kinds of knowledge in databases 
 Incorporation of background knowledge allow discovered 
patterns to be expressed in concise terms at different levels of abstraction 
 Handling noise and incomplete data may confuse the process 
 Pattern evaluation: the interestingness problem 
 Expression and visualization of data mining results 
in expressive forms so that knowledge is easily understood, usable by humans 
57 © Copyright 2006, Natasha Balac
Major Issues in Data Mining 
 Performance and scalability huge data volumes 
 Efficiency of data mining algorithms 
 Parallel, distributed and incremental mining 
methods 
 Issues relating to the diversity of data types 
 Handling relational and complex types of data 
 Mining information from diverse databases 
Unrealistic to expect one system to mine all kinds of data 
58 © Copyright 2006, Natasha Balac
Major Issues in Data Mining 
 Issues related to applications and social 
impacts 
 Application of discovered knowledge 
 Domain-specific data mining tools 
 Intelligent query answering 
 Expert systems 
 Process control and decision making 
 A knowledge fusion problem 
 Protection of data security, integrity, and privacy 
59 © Copyright 2006, Natasha Balac
Summary 
 Data mining: discovering interesting patterns 
from large amounts of data 
 A KDD process includes data cleaning, data 
integration, data selection, transformation, data 
mining, pattern evaluation, and knowledge 
presentation 
60 © Copyright 2006, Natasha Balac
Summary 
 Mining can be performed in a variety of 
information repositories 
 Data mining functionalities: characterization, 
association, classification, clustering, outlier 
and trend analysis, etc. 
 Classification of data mining systems 
 Major issues in data mining 
61 © Copyright 2006, Natasha Balac
Kinds of Data Mining 
 Decision Tree Learning 
 Clustering 
 Neural Networks 
 Association Rules 
 Support Vector Machines 
 Genetic Algorithms 
 Nearest Neighbor Method 
62 © Copyright 2006, Natasha Balac
Decision Tree Example 
Grandparents 
A lot 
A little 
63 © Copyright 2006, Natasha Balac
DECISION TREE FOR THE CONCEPT 
“Play Tennis” 
Day Outlook Temp Humidity Wind PlayTennis 
D1 Sunny Hot High Weak No 
D2 Sunny Hot High Strong No 
D3 Overcast Hot High Weak Yes 
D4 Rain Mild High Weak Yes 
D5 Rain Cool Normal Weak Yes 
D6 Rain Cool Normal Strong No 
D7 Overcast Cool Normal Strong Yes 
D8 Sunny Mild High Weak No 
D9 Sunny Cool Normal Weak Yes 
D10 Rain Mild Normal Weak Yes 
D11 Sunny Mild Normal Strong Yes 
D12 Overcast Mild High Strong Yes 
D13 Overcast Hot Normal Weak Yes 
Mitchell, 1D99174 Rain Mild High Strong No 
64 © Copyright 2006, Natasha Balac
DECISION TREE FOR THE CONCEPT 
“Play Tennis” 
[Mitchell,1997] 
65 © Copyright 2006, Natasha Balac

More Related Content

What's hot

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378nitttin
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_publicAttila Barta
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Houw Liong The
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 

What's hot (20)

Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Introduction data mining
Introduction data miningIntroduction data mining
Introduction data mining
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Data mining
Data miningData mining
Data mining
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Introduction to DataMining
Introduction to DataMiningIntroduction to DataMining
Introduction to DataMining
 
Data mining 1
Data mining 1Data mining 1
Data mining 1
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Dwdm
DwdmDwdm
Dwdm
 
Data miningppt378
Data miningppt378Data miningppt378
Data miningppt378
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 
Ch 1 intro_dw
Ch 1 intro_dwCh 1 intro_dw
Ch 1 intro_dw
 

Viewers also liked

Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data miningDevakumar Jain
 
Tips On Becoming A Better Runner
Tips On Becoming A Better RunnerTips On Becoming A Better Runner
Tips On Becoming A Better RunnerElena Merchand
 
Presentatie 8 oktober partnerbijeenkomst gemeenten en afvaldiensten
Presentatie 8 oktober partnerbijeenkomst gemeenten en afvaldienstenPresentatie 8 oktober partnerbijeenkomst gemeenten en afvaldiensten
Presentatie 8 oktober partnerbijeenkomst gemeenten en afvaldienstenWEEE Nederland
 
Geena for merge
Geena   for merge Geena   for merge
Geena for merge geenadon
 
Summer Workout Tips | Elena Merchand
Summer Workout Tips | Elena Merchand Summer Workout Tips | Elena Merchand
Summer Workout Tips | Elena Merchand Elena Merchand
 
Ch.1/L1 - Italy: the Birthplace of the Renaissance
Ch.1/L1 - Italy: the Birthplace of the RenaissanceCh.1/L1 - Italy: the Birthplace of the Renaissance
Ch.1/L1 - Italy: the Birthplace of the Renaissancecalebgunnels
 
Green king game review
Green king game reviewGreen king game review
Green king game reviewWEEE Nederland
 
Presenting the fresh, modern style of Hodgson Design Associates
Presenting the fresh, modern style of Hodgson Design AssociatesPresenting the fresh, modern style of Hodgson Design Associates
Presenting the fresh, modern style of Hodgson Design AssociatesHDA-Vancouver
 
Avant projet de constitution de la Ve République du Burkina
Avant projet de constitution de la Ve République du BurkinaAvant projet de constitution de la Ve République du Burkina
Avant projet de constitution de la Ve République du BurkinaAboubakar Sanfo
 
County of brant police department interview questions copy
County of brant police department interview questions   copyCounty of brant police department interview questions   copy
County of brant police department interview questions copyselinasimpson69
 

Viewers also liked (19)

Knowledge discovery thru data mining
Knowledge discovery thru data miningKnowledge discovery thru data mining
Knowledge discovery thru data mining
 
Kdd process
Kdd processKdd process
Kdd process
 
فضائل الحج والعمرة - د جاد المولى عبد العزيز
فضائل الحج والعمرة - د جاد المولى عبد العزيزفضائل الحج والعمرة - د جاد المولى عبد العزيز
فضائل الحج والعمرة - د جاد المولى عبد العزيز
 
So tay ket cau
So tay ket cauSo tay ket cau
So tay ket cau
 
Beginner Yoga Poses
Beginner Yoga PosesBeginner Yoga Poses
Beginner Yoga Poses
 
Tips On Becoming A Better Runner
Tips On Becoming A Better RunnerTips On Becoming A Better Runner
Tips On Becoming A Better Runner
 
Anaswara
AnaswaraAnaswara
Anaswara
 
Presentatie 8 oktober partnerbijeenkomst gemeenten en afvaldiensten
Presentatie 8 oktober partnerbijeenkomst gemeenten en afvaldienstenPresentatie 8 oktober partnerbijeenkomst gemeenten en afvaldiensten
Presentatie 8 oktober partnerbijeenkomst gemeenten en afvaldiensten
 
Geena for merge
Geena   for merge Geena   for merge
Geena for merge
 
Format Teks HTML
Format Teks HTMLFormat Teks HTML
Format Teks HTML
 
Summer Workout Tips | Elena Merchand
Summer Workout Tips | Elena Merchand Summer Workout Tips | Elena Merchand
Summer Workout Tips | Elena Merchand
 
Ch.1/L1 - Italy: the Birthplace of the Renaissance
Ch.1/L1 - Italy: the Birthplace of the RenaissanceCh.1/L1 - Italy: the Birthplace of the Renaissance
Ch.1/L1 - Italy: the Birthplace of the Renaissance
 
Uji zat makanan
Uji zat makanan Uji zat makanan
Uji zat makanan
 
Green king game review
Green king game reviewGreen king game review
Green king game review
 
Presenting the fresh, modern style of Hodgson Design Associates
Presenting the fresh, modern style of Hodgson Design AssociatesPresenting the fresh, modern style of Hodgson Design Associates
Presenting the fresh, modern style of Hodgson Design Associates
 
Hand tool kelompok 7
Hand tool   kelompok 7Hand tool   kelompok 7
Hand tool kelompok 7
 
Group work
Group workGroup work
Group work
 
Avant projet de constitution de la Ve République du Burkina
Avant projet de constitution de la Ve République du BurkinaAvant projet de constitution de la Ve République du Burkina
Avant projet de constitution de la Ve République du Burkina
 
County of brant police department interview questions copy
County of brant police department interview questions   copyCounty of brant police department interview questions   copy
County of brant police department interview questions copy
 

Similar to DM Lecture 2

Similar to DM Lecture 2 (20)

01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notesData Mining mod1 ppt.pdf bca sixth semester notes
Data Mining mod1 ppt.pdf bca sixth semester notes
 
Unit 1
Unit 1Unit 1
Unit 1
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 Lecture01
Dbm630 Lecture01Dbm630 Lecture01
Dbm630 Lecture01
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Unit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.pptUnit 1 (Chapter-1) on data mining concepts.ppt
Unit 1 (Chapter-1) on data mining concepts.ppt
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
Introduction
IntroductionIntroduction
Introduction
 
Dm lecture1
Dm lecture1Dm lecture1
Dm lecture1
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

DM Lecture 2

  • 1. 1 Introduction to Data Mining Credit: Natasha Balac, Ph.D.
  • 2. Outline  Motivation: Why Data Mining?  What is Data Mining?  History of Data Mining  Data Mining Functionality and Terminology  Data Mining Applications  Are all the Patterns Interesting?  Issues in Data Mining 2 © Copyright 2006, Natasha Balac
  • 3. Evolution of Database Technology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases 3 © Copyright 2006, Natasha Balac
  • 4. Necessity is the Mother of Invention  Data explosion  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories  We are drowning in data, but starving for knowledge! 4 © Copyright 2006, Natasha Balac
  • 5. Necessity is the Mother of Invention  We are drowning in data, but starving for knowledge!  Solution - Data Mining  Data Warehousing and online analytical processing  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases 5 © Copyright 2006, Natasha Balac
  • 6. Data Mining Main Objectives  Identification of data as a source of useful information  Use of discovered information for competitive advantages when working in business environment 6 © Copyright 2006, Natasha Balac
  • 7. Why DATA MINING?  Huge amounts of data  Electronic records of our decisions  Choices in the supermarket  Financial records  Our comings and goings  We swipe our way through the world – every swipe is a record in a database  Data rich – but information poor  Lying hidden in all this data is information! 7 © Copyright 2006, Natasha Balac
  • 8. Data vs. Information  Society produces massive amounts of data  business, science, medicine, economics, sports, …  Potentially valuable resource  Raw data is useless  need techniques to automatically extract information  Data: recorded facts (as in databases)  Information: patterns underlying the data  Patterns must be discovered automatically Data – Information - Knowledge 8 © Copyright 2006, Natasha Balac
  • 9. What is DATA MINING?  Extracting or “mining” knowledge from large amounts of data  Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data  Process for Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data 9 © Copyright 2006, Natasha Balac
  • 10. What Is Data Mining?  Data mining:  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 10 © Copyright 2006, Natasha Balac
  • 11. Data Mining is NOT  Data Warehousing  (Deductive) query processing  SQL/ Reporting  Software Agents  Expert Systems  Online Analytical Processing (OLAP)  Statistical Analysis Tool  Data visualization 11 © Copyright 2006, Natasha Balac
  • 12. Data Mining  Programs that detect patterns and rules in the data  Strong patterns can be used to make non-trivial predictions on new data 12 © Copyright 2006, Natasha Balac
  • 13. Data Mining Challenges  Problem 1: most patterns are not interesting  Problem 2: patterns may be inexact or completely spurious when noisy data present 13 © Copyright 2006, Natasha Balac
  • 14. Machine Learning Techniques  Technical basis for data mining: algorithms for acquiring structural descriptions from examples  Methods originate from artificial intelligence, statistics, and research on databases 14 © Copyright 2006, Natasha Balac
  • 15. Machine Learning Techniques  Structural descriptions represent patterns explicitly can be used to  predict outcome in new situation  understand and explain how prediction is derived (maybe even more important) 15 © Copyright 2006, Natasha Balac
  • 16. Multidisciplinary Field Database Technology Statistics Data Mining Visualization Other Disciplines Machine Learning Artificial Intelligence 16 © Copyright 2006, Natasha Balac
  • 17. Multidisciplinary Field  Database technology  Artificial Intelligence  Machine Learning including Neural Networks  Statistics  Pattern recognition  Knowledge-based systems/acquisition  High-performance computing  Data visualization 17 © Copyright 2006, Natasha Balac
  • 18. History of Data Mining 18 © Copyright 2006, Natasha Balac
  • 19. History  Emerged late 1980s  Flourished –1990s  Roots traced back along three family lines  Classical Statistics  Artificial Intelligence  Machine Learning 19 © Copyright 2006, Natasha Balac
  • 20. Statistics  Foundation of most DM technologies  Regression analysis, standard distribution/deviation/variance, cluster analysis, confidence intervals  Building blocks  Significant role in today’s data mining – but alone is not powerful enough 20 © Copyright 2006, Natasha Balac
  • 21. Artificial Intelligence  Heuristics vs. Statistics  Human-thought-like processing  Requires vast computer processing power  Supercomputers 21 © Copyright 2006, Natasha Balac
  • 22. Machine Learning  Union of statistics and AI  Blends AI heuristics with advanced statistical analysis  Machine Learning – let computer programs  learn about data they study - make different decisions based on the quality of studied data  using statistics for fundamental concepts and adding more advanced AI heuristics and algorithms 22 © Copyright 2006, Natasha Balac
  • 23. Data Mining  Adoption of the Machine learning techniques to the real world problems  Union: Statistics, AI, Machine learning  Used to find previously hidden trends or patterns  Finding increasing acceptance in science and business areas which need to analyze large amount of data to discover trends which could not be found otherwise 23 © Copyright 2006, Natasha Balac
  • 24. Terminology  Gold Mining (turn data tombs into golden nuggets of knowledge. Finding small sets of precious nuggets from a great deal of raw material)  Knowledge mining from databases  Knowledge extraction  Data/pattern analysis  Knowledge Discovery Databases or KDD  Information harvesting  Business intelligence 24 © Copyright 2006, Natasha Balac
  • 25. Data Mining: A KDD Process  Data mining: the core of Pattern Evaluation Data Mining Data Mining: Concepts and Data Warehouse 2Se5ptember 6, 2014 © Copyright 2006, Natasha Balac Techniques 25 knowledge discovery process. Data Cleaning Data Integration Databases Task-relevant Data Selection and Transformation
  • 26. Steps of a KDD Process  Learning the application domain:  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (remove noise, may take 60% of effort!)  Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge 26 © Copyright 2006, Natasha Balac
  • 27. Data Mining and Business Intelligence Increasing potential to support business decisions End User Paper, Files, Information Providers, Database Systems, OLTP September 6, 2014 Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Data Mining: Concepts and Business Analyst Data Analyst DBA Techniques 27 27 © Copyright 2006, Natasha Balac
  • 28. Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Filtering Data Warehouse Data Mining: Concepts and S2ep8tember 6, 2014 © Copyright 2006, Natasha Balac Techniques 28 Databases Knowledge-base Guide search , evaluate interestingness in patterns Interestingness measures to focus search to interesting patterns
  • 29. Data Mining: On What Kind of Data? Should be applicable to any kind of data repository, and transient data (data streams) Types of databases and information repositories on which data mining can be performed  Relational databases  Data warehouses  Transactional databases  Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW 29 © Copyright 2006, Natasha Balac
  • 30. LEARNING ALGORITHMS  Fundamental idea: learn rules/patterns/relationships automatically from the data 30 © Copyright 2006, Natasha Balac
  • 31. Data Mining Tasks  Exploratory Data Analysis  Predictive Modeling inference on current data to make predictions  Classification and Regression  Descriptive Modeling characterise general properties of the data  Cluster analysis/segmentation  Discovering Patterns and Rules  Association/Dependency rules  Sequential patterns  Temporal sequences  Deviation detection 31 © Copyright 2006, Natasha Balac
  • 32. Data Mining Tasks Data is associated with classes (eg computers, printers) or concepts (eg customer types)  Concept/Class description: Characterization and discrimination  Generalize, summarize (target class), and contrast data characteristics, e.g., dry vs. wet regions (contrasting class) Frequent patterns: patterns occurring frequently in the data Frequent itemsets: set of items frequently appearing together in transactional data  Association (correlation and causality)  Multi-dimensional or single-dimensional association age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”) age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [support = 2%, confidence = 60%] buys(T, “computer”)  buys(T, “software”) [1%, 50%] Assoc. rule discarded as uninteresting if not satisfying minimum support threshold and minimum confidence threshold 32 © Copyright 2006, Natasha Balac
  • 33. Data Mining Tasks  Classification  Finding models (functions) that describe and distinguish classes or concepts for future prediction  Example: classify countries based on climate, or classify cars based on gas mileage Derived model based on analysis of a set of training data (objects of known class label)  Model may be represented in various forms:  If-THEN rules, decision-tree, classification rule, neural network  Prediction  Models continuous valued function - Predict some unknown or missing numerical values ( eg Regression analysis) Relevance analysis – identify attributes not contributing to classification/prediction, hence can be excluded 33 © Copyright 2006, Natasha Balac
  • 34. Data Mining Tasks  Cluster analysis  Class label is unknown: Group data to form new classes,  Example: cluster houses to find distribution patterns  Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Each cluster formed may be viewed as a class of objects 34 © Copyright 2006, Natasha Balac
  • 35. Data Mining Tasks  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Mostly considered as noise or exception, but is quite useful in fraud detection, rare events analysis  Trend and evolution analysis model regularities or trends for objects whose behaviour changes with time  Trend and deviation: regression analysis  Sequential pattern mining, periodicity analysis 35 © Copyright 2006, Natasha Balac
  • 36. Data Mining: Classification Schemes  General functionality  Descriptive data mining Vs. Predictive data mining (DDM – characterise general properties of data. PDM – perform inference on data in order to make predictions)  Different views - different classifications  Kinds of databases to be mined  Kinds of knowledge to be discovered  Kinds of techniques employed 36 © Copyright 2006, Natasha Balac  Kinds of applications
  • 37. A Multi-Dimensional View of Data Mining Classification  Databases to be mined  Relational, transactional, object-oriented, object-relational, active, spatial, time-series, text, multi-media, WWW, etc.  Knowledge to be mined  Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc.  Multiple/integrated functions  Mining at multiple levels of abstractions 37 © Copyright 2006, Natasha Balac
  • 38. A Multi-Dimensional View of Data Mining Classification  Techniques utilized  Decision/Regression trees, clustering, neural networks, etc.  Applications adapted  Retail, telecom, banking, DNA mining, stock market analysis, Web mining 38 © Copyright 2006, Natasha Balac
  • 39. Data Mining Applications  Science: Chemistry, Physics, Medicine  Biochemical analysis  Remote sensors on a satellite  Telescopes – star galaxy classification  Medical Image analysis 39 © Copyright 2006, Natasha Balac
  • 40. Data Mining Applications  Bioscience  Sequence-based analysis  Protein structure and function prediction  Protein family classification  Microarray gene expression 40 © Copyright 2006, Natasha Balac
  • 41. Data Mining Applications  Pharmaceutical companies, Insurance and Health care, Medicine  Drug development  Identify successful medical therapies  Claims analysis, fraudulent behavior  Medical diagnostic tools  Predict office visits 41 © Copyright 2006, Natasha Balac
  • 42. Data Mining Applications  Financial Industry, Banks, Businesses, E-commerce  Stock and investment analysis  Identify loyal customers vs. risky customer  Predict customer spending  Risk management  Sales forecasting 42 © Copyright 2006, Natasha Balac
  • 43. Data Mining Applications  Retail and Marketing  Customer buying patterns/demographic characteristics  Mailing campaigns  Market basket analysis  Trend analysis 43 © Copyright 2006, Natasha Balac
  • 44. Data Mining Applications  Database analysis and decision support  Market analysis and management  target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management 44 © Copyright 2006, Natasha Balac
  • 45. Data Mining Applications  Sports and Entertainment  IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining 45 © Copyright 2006, Natasha Balac
  • 46. DATA MINING EXAMPLES  Grocery store  NBA  Banking and Credit Card scoring  Fraud detection  Personalization & Customer Profiling  Campaign Management and Database Marketing 46 © Copyright 2006, Natasha Balac
  • 47. Data mining at work: Case study 1 47 © Copyright 2006, Natasha Balac
  • 48. Processing Loan Applications  Given: questionnaire with financial and personal information  Problem: should money be lent?  Borderline cases referred to loan officers  But: 50% of accepted borderline cases defaulted!  Solution:  reject all borderline cases?  Borderline cases are most active customers! 48 © Copyright 2006, Natasha Balac
  • 49. Enter Machine Learning  Given:  1000 training examples of borderline cases  20 attributes:  age, years with current employer,years at current address, years with the bank, years at current job, other credit cards  Learned rules predicted 2/3 of borderline cases correctly!  Rules could be used to explain decisions to customers 49 © Copyright 2006, Natasha Balac
  • 50. Case study 2: Screening images  Given:  radar satellite images of coastal waters  Problem:  detecting oil slicks in those images  Oil slicks = dark regions with changing size and shape  Look-alike dark regions can be caused by weather conditions (e.g. high wind)  Expensive process requiring highly trained personnel 50 © Copyright 2006, Natasha Balac
  • 51. Enter Machine Learning  Dark regions extracted from normalized image  Attributes:  size of region, shape, area, intensity, sharpness and jaggedness of boundaries, proximity of other regions, info about background  Constraints:  Scarcity of training examples (oil slicks are rare!)  Unbalanced data: most dark regions aren’t oil slicks  Regions from same image form a batch  Requirement is adjustable false-alarm rate 51 © Copyright 2006, Natasha Balac
  • 52. Data Mining Challenges  Computationally expensive to investigate all possibilities  Dealing with noise/missing information and errors in data  Choosing appropriate attributes/input representation  Finding the minimal attribute space  Finding adequate evaluation function(s)  Extracting meaningful information  Not overfitting 52 © Copyright 2006, Natasha Balac
  • 53. Are All the “Discovered” Patterns Interesting? DM system may generate thousands/millions of patterns/rules  Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm support: %age of transactions the given rule satisfies support( X => Y ) = P ( X U Y ) confidence: degree of certainty of detected association confidence(X => Y ) = P ( Y | X ) 53 © Copyright 2006, Natasha Balac
  • 54. Are All the “Discovered” Patterns Interesting?  Objective vs. subjective measures:  Objective: based on statistics and structures of patterns  support and confidence  Subjective: based on user’s belief in the data  unexpectedness, novelty, actionability, confirm hypothesis user wishes to validate or resemble hunch 54 © Copyright 2006, Natasha Balac
  • 55. Can We Find All and Only Interesting Patterns? An Optimisation problem – completeness of DM algo  Completeness - Find all the interesting patterns  Can a data mining system find all the interesting patterns?  Association vs. classification vs. clustering 55 © Copyright 2006, Natasha Balac
  • 56. Can We Find All and Only Interesting Patterns?  Optimization - Search for only interesting patterns  Can a data mining system find only the interesting patterns?  Approaches  First generate all the patterns and then filter out the uninteresting ones  Mining query optimization (improve search efficiency by pruning away subsets of the pattern space not satisfying interestingness constraints) 56 © Copyright 2006, Natasha Balac
  • 57. Major Issues in Data Mining  Mining methodology and user interaction  Mining different kinds of knowledge in databases  Incorporation of background knowledge allow discovered patterns to be expressed in concise terms at different levels of abstraction  Handling noise and incomplete data may confuse the process  Pattern evaluation: the interestingness problem  Expression and visualization of data mining results in expressive forms so that knowledge is easily understood, usable by humans 57 © Copyright 2006, Natasha Balac
  • 58. Major Issues in Data Mining  Performance and scalability huge data volumes  Efficiency of data mining algorithms  Parallel, distributed and incremental mining methods  Issues relating to the diversity of data types  Handling relational and complex types of data  Mining information from diverse databases Unrealistic to expect one system to mine all kinds of data 58 © Copyright 2006, Natasha Balac
  • 59. Major Issues in Data Mining  Issues related to applications and social impacts  Application of discovered knowledge  Domain-specific data mining tools  Intelligent query answering  Expert systems  Process control and decision making  A knowledge fusion problem  Protection of data security, integrity, and privacy 59 © Copyright 2006, Natasha Balac
  • 60. Summary  Data mining: discovering interesting patterns from large amounts of data  A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation 60 © Copyright 2006, Natasha Balac
  • 61. Summary  Mining can be performed in a variety of information repositories  Data mining functionalities: characterization, association, classification, clustering, outlier and trend analysis, etc.  Classification of data mining systems  Major issues in data mining 61 © Copyright 2006, Natasha Balac
  • 62. Kinds of Data Mining  Decision Tree Learning  Clustering  Neural Networks  Association Rules  Support Vector Machines  Genetic Algorithms  Nearest Neighbor Method 62 © Copyright 2006, Natasha Balac
  • 63. Decision Tree Example Grandparents A lot A little 63 © Copyright 2006, Natasha Balac
  • 64. DECISION TREE FOR THE CONCEPT “Play Tennis” Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes Mitchell, 1D99174 Rain Mild High Strong No 64 © Copyright 2006, Natasha Balac
  • 65. DECISION TREE FOR THE CONCEPT “Play Tennis” [Mitchell,1997] 65 © Copyright 2006, Natasha Balac