SlideShare a Scribd company logo
1 of 46
1
1
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 1 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
3
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes
 Data collection and data availability
 Automated data collection tools, database systems, Web,
computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
Why Mine Data?
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Commercial Viewpoint
Why Mine Data? Scientific Viewpoint
 Data collected and stored at
enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 The Large Hadron Collider generates 40 TB/sec
 scientific simulations
generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
 in classifying and segmenting data
 in Hypothesis Formation
Mining Large Data Sets - Motivation
 There is often information “hidden” in the data that
is
not readily evident
 Human analysts may take weeks to discover useful
information
 Much of the data is never analyzed at all
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of
analysts
8
Evolution of Sciences
 Before 1600, empirical science
 1600-1950s, theoretical science
 Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
 1950s-1990s, computational science
 Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
 Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
 1990-now, data science
 The flood of data from new scientific instruments and simulations
 The ability to economically store and manage petabytes of data online
 The Internet and computing Grid that makes all these archives universally accessible
 Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
 Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
9
Evolution of Database Technology
 1960s:
 Data collection, database creation, IMS and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems
10
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
11
What Is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
12
Knowledge Discovery (KDD) Process
 This is a view from typical
database systems and data
warehousing communities
 Data mining plays an essential
role in the knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
 Draws ideas from machine learning/AI,
pattern recognition, statistics, and database
systems
 Traditional Techniques
may be unsuitable due to
 Enormity of data
 High dimensionality
of data
 Heterogeneous,
distributed nature
of data
Origins of Data Mining
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database
systems
Data Mining Tasks
 Prediction Methods
 Use some variables to predict unknown or future
values of other variables.
 Description Methods
 Find human-interpretable patterns that describe the
data.
15
Chapter 1. Introduction
 Why Data Mining?
 What Is Data Mining?
 A Multi-Dimensional View of Data Mining
 What Kind of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 What Technology Are Used?
 What Kind of Applications Are Targeted?
 Major Issues in Data Mining
 A Brief History of Data Mining and Data Mining Society
 Summary
16
Multi-Dimensional View of Data Mining
 Data to be mined
 Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social
and information networks
 Knowledge to be mined (or: Data mining functions)
 Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
 Descriptive vs. predictive data mining
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
Data Mining Tasks
 Prediction Methods
 Use some variables to predict unknown or future
values of other variables.
 Description Methods
 Find human-interpretable patterns that describe the
data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
17
Introduction to Data Mining, 2nd Edition
01/17/2018
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10
Milk
Data
Data Mining Tasks …
18
Introduction to Data Mining, 2nd Edition
01/17/2018
 Find a model for class attribute as a function of
the values of other attributes
Tid Employed
Level of
Education
# years at
present
address
Credit
Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10
Model for predicting credit
worthiness
Class Employed
No Education
Number of
years
No Yes
Graduate
{ High school,
Undergrad }
Yes No
> 7 yrs < 7 yrs
Yes
Number of
years
No
> 3 yr < 3 yr
Predictive Modeling: Classification
19
Introduction to Data Mining, 2nd Edition
01/17/2018
Classification Example
Test
Set
Training
Set
Model
Learn
Classifier
Tid Employed
Level of
Education
# years at
present
address
Credit
Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
… … … … …
10
Tid Employed
Level of
Education
# years at
present
address
Credit
Worthy
1 Yes Undergrad 7 ?
2 No Graduate 3 ?
3 Yes High School 2 ?
… … … … …
10
20
Introduction to Data Mining, 2nd Edition
01/17/2018
 Classifying credit card transactions
as legitimate or fraudulent
 Classifying land covers (water bodies, urban areas,
forests, etc.) using satellite data
 Categorizing news stories as finance,
weather, entertainment, sports, etc
 Identifying intruders in the cyberspace
 Predicting tumor cells as benign or malignant
 Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random coil
Examples of Classification Task
21
Introduction to Data Mining, 2nd Edition
01/17/2018
Classification: Application 1
 Fraud Detection
 Goal: Predict fraudulent cases in credit card
transactions.
 Approach:
 Use credit card transactions and the information
on its account-holder as attributes.
 When does a customer buy, what does he buy,
how often he pays on time, etc
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit
card transactions on an account.
22
Introduction to Data Mining, 2nd Edition
01/17/2018
Classification: Application 2
 Churn prediction for telephone customers
 Goal: To predict whether a customer is likely to be lost
to a competitor.
 Approach:
 Use detailed record of transactions with each of the
past and present customers, to find attributes.
 How often the customer calls, where he calls, what
time-of-the day he calls most, his financial status,
marital status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
23
Introduction to Data Mining, 2nd Edition
01/17/2018
Classification: Application 3
 Sky Survey Cataloging
 Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
 3000 images with 23,040 x 23,040 pixels per image.
 Approach:
 Segment the image.
 Measure image attributes (features) - 40 of them per
object.
 Model the class based on these features.
 Success Story: Could find 16 new high red-shift
quasars, some of the farthest objects that are
difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
24
Introduction to Data Mining, 2nd Edition
01/17/2018
Classifying Galaxies
Early
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Courtesy: http://aps.umn.edu
25
Introduction to Data Mining, 2nd Edition
01/17/2018
Regression
 Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
 Extensively studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based on
advetising expenditure.
 Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
 Time series prediction of stock market indices.
26
Introduction to Data Mining, 2nd Edition
01/17/2018
April 5, 2024 27
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
Clustering
28
Introduction to Data Mining, 2nd Edition
01/17/2018
April 5, 2024 29
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
April 5, 2024 30
Examples of Clustering
Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Clustering: Application 1
 Market Segmentation:
 Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
 Approach:
 Collect different attributes of customers based on their
geographical and lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different
clusters.
Clustering: Application 2
 Document Clustering:
 Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
 Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
 Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
April 5, 2024 33
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
Association Rule Discovery:
Definition
 Given a set of records each of which contain some
number of items from a given collection
 Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
34
Introduction to Data Mining, 2nd Edition
01/17/2018
Association Rule: Basic
Concepts
 Given: (1) database of transactions, (2) each transaction is
a list of items (purchased by a customer in a visit)
 Find: all rules that correlate the presence of one set of
items with that of another set of items
 E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
 Applications
 *  Maintenance Agreement (What the store should
do to boost Maintenance Agreement sales)
 Home Electronics  * (What other products should
the store stocks up?)
 Attached mailing in direct marketing
 Detecting “ping-pong”ing of patients, faulty “collisions”
4/5/2024
Rule Measures: Support and
Confidence
 Find all the rules X & Y  Z with
minimum confidence and support
 support, s, probability that a
transaction contains {X Y Z}
 confidence, c, conditional
probability that a transaction
having {X Y} also contains Z
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Let minimum support 50%, and
minimum confidence 50%, we have
 A  C (50%, 66.6%)
 C  A (50%, 100%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Association Rule Discovery: Application 1
 Marketing and Sales Promotion:
 Let the rule discovered be
{Bagels, … } --> {Potato Chips}
 Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
 Bagels in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling bagels.
 Bagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!
Association Rule Discovery: Application 2
 Supermarket shelf management.
 Goal: To identify items that are bought together by
sufficiently many customers.
 Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very
likely to buy beer.
 So, don’t be surprised if you find six-packs stacked
next to diapers!
Association Rule Discovery: Application 3
 Inventory Management:
 Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households.
 Approach: Process the data on tools and parts required
in previous repairs at different consumer locations and
discover the co-occurrence patterns.
Association Analysis:
Applications
 Market-basket analysis
 Rules are used for sales promotion, shelf management,
and inventory management
 Telecommunication alarm diagnosis
 Rules are used to find combination of alarms that occur
together frequently in the same time period
 Medical Informatics
 Rules are used to find combination of patient symptoms
and test results associated with certain diseases
40
Introduction to Data Mining, 2nd Edition
01/17/2018
41
Data Mining Function: (5) Outlier Analysis
 Outlier analysis
 Outlier: A data object that does not comply with the general
behavior of the data
 Noise or exception? ― One person’s garbage could be another
person’s treasure
 Methods: by product of clustering or regression analysis, …
 Useful in fraud detection, rare events analysis
Motivating Challenges
 Scalability
 High Dimensionality
 Heterogeneous and Complex Data
 Mining various and new kinds of knowledge
42
Introduction to Data Mining, 2nd Edition
01/17/2018
43
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
44
Example: Mining vs. Data Exploration
 Business intelligence view
 Warehouse, data cube, reporting but not much mining
 Business objects vs. data mining tools
 Supply chain example: tools
 Data presentation
 Exploration
45
KDD Process: A Typical View from ML and
Statistics
Input Data Data
Mining
Data Pre-
Processing
Post-
Processing
 This is a view from typical machine learning and statistics communities
Data integration
Normalization
Feature selection
Dimension reduction
Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis
… … … …
Pattern evaluation
Pattern selection
Pattern interpretation
Pattern visualization
46
Example: Medical Data Mining
 Health care & medical data mining – often
adopted such a view in statistics and machine
learning
 Preprocessing of the data (including feature
extraction and dimension reduction)
 Classification or/and clustering processes
 Post-processing for presentation

More Related Content

Similar to Introduction to Data Mining and technologies .ppt

Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and butest
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databasesbutest
 
data mining
data miningdata mining
data mininguoitc
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic conceptsPritiRishi
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.docbutest
 
A Review On Data Mining From Past To The Future
A Review On Data Mining From Past To The FutureA Review On Data Mining From Past To The Future
A Review On Data Mining From Past To The FutureKaela Johnson
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryLuis Goldster
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryTony Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHoang Nguyen
 

Similar to Introduction to Data Mining and technologies .ppt (20)

Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
 
data mining
data miningdata mining
data mining
 
01datamining.pdf
01datamining.pdf01datamining.pdf
01datamining.pdf
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Ci2004-10.doc
Ci2004-10.docCi2004-10.doc
Ci2004-10.doc
 
A Review On Data Mining From Past To The Future
A Review On Data Mining From Past To The FutureA Review On Data Mining From Past To The Future
A Review On Data Mining From Past To The Future
 
Dwdm
DwdmDwdm
Dwdm
 
Dm lecture1
Dm lecture1Dm lecture1
Dm lecture1
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Introduction to Data Mining and technologies .ppt

  • 1. 1 1 Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University ©2011 Han, Kamber & Pei. All rights reserved.
  • 2. 2 Chapter 1. Introduction  Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 3. 3 Why Data Mining?  The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge!  “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
  • 5.  Lots of data is being collected and warehoused  Web data, e-commerce  purchases at department/ grocery stores  Bank/Credit Card transactions  Computers have become cheaper and more powerful  Competitive Pressure is Strong  Provide better, customized services for an edge (e.g. in Customer Relationship Management) Why Mine Data? Commercial Viewpoint
  • 6. Why Mine Data? Scientific Viewpoint  Data collected and stored at enormous speeds (GB/hour)  remote sensors on a satellite  telescopes scanning the skies  The Large Hadron Collider generates 40 TB/sec  scientific simulations generating terabytes of data  Traditional techniques infeasible for raw data  Data mining may help scientists  in classifying and segmenting data  in Hypothesis Formation
  • 7. Mining Large Data Sets - Motivation  There is often information “hidden” in the data that is not readily evident  Human analysts may take weeks to discover useful information  Much of the data is never analyzed at all 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 1995 1996 1997 1998 1999 The Data Gap Total new disk (TB) since 1995 Number of analysts
  • 8. 8 Evolution of Sciences  Before 1600, empirical science  1600-1950s, theoretical science  Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.  1950s-1990s, computational science  Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)  Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.  1990-now, data science  The flood of data from new scientific instruments and simulations  The ability to economically store and manage petabytes of data online  The Internet and computing Grid that makes all these archives universally accessible  Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!  Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
  • 9. 9 Evolution of Database Technology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.)  1990s:  Data mining, data warehousing, multimedia databases, and Web databases  2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems
  • 10. 10 Chapter 1. Introduction  Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 11. 11 What Is Data Mining?  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer?  Alternative names  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.  Watch out: Is everything “data mining”?  Simple search and query processing  (Deductive) expert systems
  • 12. 12 Knowledge Discovery (KDD) Process  This is a view from typical database systems and data warehousing communities  Data mining plays an essential role in the knowledge discovery process Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
  • 13.  Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems  Traditional Techniques may be unsuitable due to  Enormity of data  High dimensionality of data  Heterogeneous, distributed nature of data Origins of Data Mining Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems
  • 14. Data Mining Tasks  Prediction Methods  Use some variables to predict unknown or future values of other variables.  Description Methods  Find human-interpretable patterns that describe the data.
  • 15. 15 Chapter 1. Introduction  Why Data Mining?  What Is Data Mining?  A Multi-Dimensional View of Data Mining  What Kind of Data Can Be Mined?  What Kinds of Patterns Can Be Mined?  What Technology Are Used?  What Kind of Applications Are Targeted?  Major Issues in Data Mining  A Brief History of Data Mining and Data Mining Society  Summary
  • 16. 16 Multi-Dimensional View of Data Mining  Data to be mined  Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks  Knowledge to be mined (or: Data mining functions)  Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.  Descriptive vs. predictive data mining  Multiple/integrated functions and mining at multiple levels  Techniques utilized  Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.  Applications adapted  Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
  • 17. Data Mining Tasks  Prediction Methods  Use some variables to predict unknown or future values of other variables.  Description Methods  Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 17 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 18. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk Data Data Mining Tasks … 18 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 19.  Find a model for class attribute as a function of the values of other attributes Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … 10 Model for predicting credit worthiness Class Employed No Education Number of years No Yes Graduate { High school, Undergrad } Yes No > 7 yrs < 7 yrs Yes Number of years No > 3 yr < 3 yr Predictive Modeling: Classification 19 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 20. Classification Example Test Set Training Set Model Learn Classifier Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … … 10 Tid Employed Level of Education # years at present address Credit Worthy 1 Yes Undergrad 7 ? 2 No Graduate 3 ? 3 Yes High School 2 ? … … … … … 10 20 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 21.  Classifying credit card transactions as legitimate or fraudulent  Classifying land covers (water bodies, urban areas, forests, etc.) using satellite data  Categorizing news stories as finance, weather, entertainment, sports, etc  Identifying intruders in the cyberspace  Predicting tumor cells as benign or malignant  Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Examples of Classification Task 21 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 22. Classification: Application 1  Fraud Detection  Goal: Predict fraudulent cases in credit card transactions.  Approach:  Use credit card transactions and the information on its account-holder as attributes.  When does a customer buy, what does he buy, how often he pays on time, etc  Label past transactions as fraud or fair transactions. This forms the class attribute.  Learn a model for the class of the transactions.  Use this model to detect fraud by observing credit card transactions on an account. 22 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 23. Classification: Application 2  Churn prediction for telephone customers  Goal: To predict whether a customer is likely to be lost to a competitor.  Approach:  Use detailed record of transactions with each of the past and present customers, to find attributes.  How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.  Label the customers as loyal or disloyal.  Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997 23 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 24. Classification: Application 3  Sky Survey Cataloging  Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).  3000 images with 23,040 x 23,040 pixels per image.  Approach:  Segment the image.  Measure image attributes (features) - 40 of them per object.  Model the class based on these features.  Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 24 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 25. Classifying Galaxies Early Intermediate Late Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Class: • Stages of Formation Attributes: • Image features, • Characteristics of light waves received, etc. Courtesy: http://aps.umn.edu 25 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 26. Regression  Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.  Extensively studied in statistics, neural network fields.  Examples:  Predicting sales amounts of new product based on advetising expenditure.  Predicting wind velocities as a function of temperature, humidity, air pressure, etc.  Time series prediction of stock market indices. 26 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 27. April 5, 2024 27 What is Cluster Analysis?  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms
  • 28.  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized Clustering 28 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 29. April 5, 2024 29 Quality: What Is Good Clustering?  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
  • 30. April 5, 2024 30 Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 31. Clustering: Application 1  Market Segmentation:  Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix.  Approach:  Collect different attributes of customers based on their geographical and lifestyle related information.  Find clusters of similar customers.  Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
  • 32. Clustering: Application 2  Document Clustering:  Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.  Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.  Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
  • 33. April 5, 2024 33 Measure the Quality of Clustering  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)  There is a separate “quality” function that measures the “goodness” of a cluster.  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.  Weights should be associated with different variables based on applications and data semantics.  It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective.
  • 34. Association Rule Discovery: Definition  Given a set of records each of which contain some number of items from a given collection  Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 34 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 35. Association Rule: Basic Concepts  Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)  Find: all rules that correlate the presence of one set of items with that of another set of items  E.g., 98% of people who purchase tires and auto accessories also get automotive services done  Applications  *  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)  Home Electronics  * (What other products should the store stocks up?)  Attached mailing in direct marketing  Detecting “ping-pong”ing of patients, faulty “collisions”
  • 36. 4/5/2024 Rule Measures: Support and Confidence  Find all the rules X & Y  Z with minimum confidence and support  support, s, probability that a transaction contains {X Y Z}  confidence, c, conditional probability that a transaction having {X Y} also contains Z Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Let minimum support 50%, and minimum confidence 50%, we have  A  C (50%, 66.6%)  C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer
  • 37. Association Rule Discovery: Application 1  Marketing and Sales Promotion:  Let the rule discovered be {Bagels, … } --> {Potato Chips}  Potato Chips as consequent => Can be used to determine what should be done to boost its sales.  Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.  Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
  • 38. Association Rule Discovery: Application 2  Supermarket shelf management.  Goal: To identify items that are bought together by sufficiently many customers.  Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.  A classic rule --  If a customer buys diaper and milk, then he is very likely to buy beer.  So, don’t be surprised if you find six-packs stacked next to diapers!
  • 39. Association Rule Discovery: Application 3  Inventory Management:  Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households.  Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.
  • 40. Association Analysis: Applications  Market-basket analysis  Rules are used for sales promotion, shelf management, and inventory management  Telecommunication alarm diagnosis  Rules are used to find combination of alarms that occur together frequently in the same time period  Medical Informatics  Rules are used to find combination of patient symptoms and test results associated with certain diseases 40 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 41. 41 Data Mining Function: (5) Outlier Analysis  Outlier analysis  Outlier: A data object that does not comply with the general behavior of the data  Noise or exception? ― One person’s garbage could be another person’s treasure  Methods: by product of clustering or regression analysis, …  Useful in fraud detection, rare events analysis
  • 42. Motivating Challenges  Scalability  High Dimensionality  Heterogeneous and Complex Data  Mining various and new kinds of knowledge 42 Introduction to Data Mining, 2nd Edition 01/17/2018
  • 43. 43 Data Mining in Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 44. 44 Example: Mining vs. Data Exploration  Business intelligence view  Warehouse, data cube, reporting but not much mining  Business objects vs. data mining tools  Supply chain example: tools  Data presentation  Exploration
  • 45. 45 KDD Process: A Typical View from ML and Statistics Input Data Data Mining Data Pre- Processing Post- Processing  This is a view from typical machine learning and statistics communities Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
  • 46. 46 Example: Medical Data Mining  Health care & medical data mining – often adopted such a view in statistics and machine learning  Preprocessing of the data (including feature extraction and dimension reduction)  Classification or/and clustering processes  Post-processing for presentation