Michael Welge, Loretta Auvil, Lisa Gatzke, Automated Learning Group, National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign
Jiawei Han, Computer Science, University of Illinois at Urbana-Champaign
Literature Data Mining – Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001 Pattern Classification by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley & Sons, 2001
Introduction to Knowledge Discovery in Databases and Data Mining
Computational Knowledge Discovery
Terminology
Data Mining
A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.
Knowledge Discovery Process
The process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
Terminology - A Working Definition
Data Mining is a “decision support” process in which we search for patterns of information in data.
Data Mining is a process of discovering advantageous patterns in data.
A pattern is a conservative statement about a probability distribution.
Webster: A pattern is (a) a natural or chance configuration, (b) a reliable sample of traits, acts, tendencies, or other observable characteristics of a person, group, or institution
Data Mining: On What Kind of Data?
Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems
Object-Relational
Spatial and Temporal
Time-Series
Multimedia
Text
Heterogeneous, Legacy, and Distributed
WWW
Structure - 3D Anatomy Function – 1D Signal Metadata – Annotation
Data Mining: Confluence of Multiple Disciplines ? 20x20 ~ 2^400 10^120 patterns
Why Do We Need Data Mining ?
Data volumes are too large for classical analysis approaches:
Large number of records (10 8 – 10 12 bytes)
High dimensional data ( 10 2 – 10 4 attributes)
How do you explore millions of records, tens or hundreds of fields, and find patterns?
Why Do We Need Data Mining ?
Leverage organization’s data assets
Only a small portion (typically - 5%-10%) of the collected data is ever analyzed
Data that may never be analyzed continues to be collected, at a great expense, out of fear that something which may prove important in the future is missing.
Growth rates of data precludes traditional “manually intensive” approach
Why Do We Need Data Mining?
As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible
Many queries of interest are difficult to state in a query language (Query formulation problem)
“find all cases of fraud”
“find all individuals likely to buy a FORD expedition”
“find all documents that are similar to this customers problem”
QUERY RESULT (Latitude, Longitude) 1 (Latitude, Longitude) 2
What is It?
Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
The understandable patterns are used to:
Make predictions or classifications about new data
Explain existing data
Summarize the contents of a large database to support decision making
Graphical data visualization to aid humans in discovering deeper patterns
Applications of Data Mining
Data Mining Applications
Market analysis
Risk analysis and management
Fraud detection and detection of unusual patterns (outliers)
Text mining (news group, email, documents) and Web mining
Association Rules , Link Analysis, Self Organizing Maps
Predictive Modeling
Classification – Naive Bayesian , Neural Networks , Decision Trees
Regression – Neural Networks , Regression Trees
Deviation Detection
Visualization
Text To Knowledge (T2K)
Image To Knowledge (I2K)
----------------------
Audio, Touch, Scent and Savor To Knowledge
Knowledge To Wisdom (K2W)
Data Mining at Work Data Sources Project Objectives Single Multiple Numerous Diagnostics Target Marketing Effluent Quality Control Decision Support Automation Transaction Management Cost Prediction (Warranty, Insurance Claims) Warranty Clustering Territorial Ratemaking Web Information Retrieval, Archival and Clustering Auto Loss Ratio Predictions Precision Farming Bio-Informatics Functional Foods Heterogeneous Data Visualization Crime Data Analysis Data Fusion and Visualization Survey Study of Disability
Examples of Data Mining Methods
Three Primary Data Mining Paradigms
Discovery
Example: Association Rules
Predictive Modeling
Classification Example: Decision Trees
Deviation Detection
Visualization
Association Rules and Market Basket Analysis
What is Market Basket Analysis?
Customer Analysis
Market Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases.
Product Analysis
Market basket Analysis gives us insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to purchase.
Market Basket Example Is soda typically purchased with bananas? Does the brand of soda make a difference? Where should detergents be placed in the Store to maximize their sales? Are window cleaning products purchased when detergents and orange juice are bought together? How are the demographics of the neighborhood affecting what customers are buying? ? ? ? ?
Association Rules
There has been a considerable amount of research in the area of Market Basket Analysis. Its appeal comes from the clarity and utility of its results, which are expressed in the form association rules .
Given
A database of transactions
Each transaction contains a set of items
Find all rules X->Y that correlate the presence of one set of items X with another set of items Y
Example: When a customer buys bread and butter, they buy milk 85% of the time
+
Results: Useful, Trivial, or Inexplicable?
While association rules are easy to understand, they are not always useful.
Useful: On Fridays convenience store customers often purchase diapers and beer together.
Trivial: Customers who purchase maintenance agreements are very likely to purchase large appliances.
Inexplicable: When a new Super Store opens, one of the most commonly sold item is light bulbs.
How Does It Work? Orange juice, Soda Milk, Orange Juice, Window Cleaner Orange Juice, Detergent Orange juice, detergent, soda Window cleaner, soda OJ 4 1 1 2 1 OJ Window Cleaner Milk Soda Detergent 1 2 1 1 0 1 1 1 0 0 2 1 0 3 1 1 0 0 1 2 Window Cleaner Milk Soda Detergent Co-Occurrence of Products Customer Items 1 2 3 4 5 Grocery Point-of-Sale Transactions Orange Juice, Soda Milk, Orange Juice, Window Cleaner Orange Juice, Detergent Orange Juice, Detergent, Soda Window Cleaner, Soda
The co-occurrence table contains some simple patterns
Orange juice and soda are more likely to be purchased together than any other two items
Detergent is never purchased with window cleaner or milk
Milk is never purchased with soda or detergent
These simple observations are examples of Associations and may suggest a formal rule like:
If a customer purchases soda, THEN the customer also purchases orange juice
How Does It Work? OJ Window Cleaner Milk Soda Detergent 1 1 1 0 0 2 1 0 3 1 1 0 0 1 2 OJ Window Cleaner Milk Soda Detergent 1 2 1 1 0 4 1 1 2 1
How Good Are the Rules?
In the data, two of five transactions include both soda and orange juice, These two transactions support the rule. The support for the rule is two out of five or 40%
Since both transactions that contain soda also contain orange juice there is a high degree of confidence in the rule. In fact every transaction that contains soda contains orange juice. So the rule If soda, THEN orange juice has a confidence of 100%.
Confidence and Support - How Good Are the Rules
A rule must have some minimum user-specified confidence
1 & 2 -> 3 has a 90% confidence if when a customer bought 1 and 2, in 90% of the cases, the customer also bought 3.
A rule must have some minimum user-specified support
1 & 2 -> 3 should hold in some minimum percentage of transactions to have value.
Confidence and Support Transaction ID # Items 1 2 3 4 { 1, 2, 3 } { 1,3 } { 1,4 } { 2, 5, 6 } Frequent One Item Set Support { 1 } { 2 } { 3 } { 4 } 75 % 50 % 50 % 25 % For minimum support = 50% = 2 transactions and minimum confidence = 50% For the rule 1=> 3: Support = Support({1,3}) = 50% Confidence (1->3) = Support ({1,3})/Support({1}) = 66% Confidence (3->1)= Support ({1,3})/Support({3}) = 100% Frequent Two Item Set Support { 1,2 } { 1,3 } { 1,4 } { 2,3 } 25 % 50 % 25 % 25 %
Association Examples
Find all rules that have “Diet Coke” as a result . These rules may help plan what the store should do to boost the sales of Diet Coke.
Find all rules that have “Yogurt” in the condition . These rules may help determine what products may be impacted if the store discontinues selling “Yogurt”.
Find all rules that have “Brats” in the condition and “mustard” in the result . These rules may help in determining the additional items that have to be sold together to make it highly likely that mustard will also be sold.
Find the best k rules that have “Yogurt” in the result .
The Basic Process
Choosing the right set of items
Taxonomies
Generation of rules
If condition Then result
Negation
Overcoming the practical limits imposed by thousand or tens of thousands of products
Minimum Support Pruning
Choosing the Right Set of Items Frozen Foods Frozen Desserts Frozen Vegetables Frozen Dinners Frozen Yogurt Frozen Fruit Bars Ice Cream Peas Carrots Mixed Other Rocky Road Chocolate Strawberry Vanilla Cherry Garcia Other Partial Product Taxonomy General Specific
Example - Minimum Support Pruning / Rule Generation Transaction ID # Items 1 2 3 4 { 1, 3, 4 } { 2, 3, 5 } { 1, 2, 3, 5 } { 2, 5 } Itemset Support { 1 } { 2 } { 3 } { 4 } { 5 } 2 3 3 1 3 Itemset Support { 2 } { 3 } { 5 } 3 3 3 Itemset { 2 } { 3 } { 5 } Itemset Support { 2, 3 } { 2, 5 } { 3, 5 } 2 3 2 Itemset Support { 2, 5 } 3 Scan Database Find Pairings Find Level of Support Scan Database Find Pairings Find Level of Support Two rules with the highest support for two item set: 2->5 and 5->2
Other Association Rule Applications
Quantitative Association Rules
Age[35..40] and Married[Yes] -> NumCars[2]
Association Rules with Constraints
Find all association rules where the prices of items are > 100 dollars
Temporal Association Rules
Diaper -> Beer (1% support, 80% confidence)
Diaper -> Beer (20%support) 7:00-9:00 PM weekdays
Optimized Association Rules
Given a rule (l < A < u) and X -> Y, Find values for l and u such that support greater than certain threshold and maximizes a support and confidence.
Decision Tree for Concept: PlayTennis Outlook? Humidity? Wind? Sunny Overcast Rain Yes No High Normal No Strong Light Outlook? Humidity? Wind? Sunny Overcast Rain Yes No High Normal No Strong Light Yes Yes Yes Yes
Decision Trees and Decision Boundaries + + - - + + + + - - y x 1 3 5 7 How to Visualize Decision Trees? Example: Dividing Instance Space into Axis-Parallel Rectangles More than two variables ? y > 7? No Yes x < 3? No Yes y < 5? No Yes x < 1? No Yes
An Illustrative Example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Day Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Temperature Humidity Wind PlayTennis? High High High High Normal Normal Normal High Normal Normal Normal High Normal High Outlook Light Strong Light Light Light Strong Strong Light Light Light Strong Strong Light Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No Training Examples for Concept PlayTennis
Constructing a Decision Tree for PlayTennis [9+, 5-] E(D) = min(9/14, 5/14) = 5/14 = 36% The Initial Decision Tree with One Leaf
Goal: maximize error reduction E, where the error reduction relative to attribute A is the expected reduction in error due to splitting on A:
Question: What attribute A and what value of A should we split on?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Day Sunny Sunny Overcast Rain Rain Rain Overcast Sunny Sunny Rain Sunny Overcast Overcast Rain Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild Temperature Humidity Wind Play Tennis? High High High High Normal Normal Normal High Normal Normal Normal High Normal High Outlook Light Strong Light Light Light Strong Strong Light Light Light Strong Strong Light Strong No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No
Decision trees are able to generate understandable results
Decision trees perform classification without requiring much computation
Decisions trees can handle both continuous and categorical variables
Decision trees provide a clear indication of which attributes are most important for prediction or classification
Weakness Of Decision Trees
Error-prone with too many classes
Quick partitioning of data results in fast deterioration in attribute selection quality
Trouble with non-rectangular regions
Visualization
Visualization Example: Naïve Bayesian Three Flower Types; Petal and Sepal Based Classification
Naïve Bayesian Visualization
The right hand pane shows the distribution of the classes.
The left hand pane shows the attributes and each of their values. They are listed by order of significance.
The message box shows details about each pie chart when brushed.
Clicking on a pie chart shows how knowing this information can change the overall class predication.
Clicking on multiple pie charts calculates conditional probabilities.
Zoom in and out using the right mouse button.
Notice Iris-versicolor has a 33% likelihood
Rule Association Visualization
Read rules down the column
Example - the rule in the column labeled as 2 is
if petal-width Binned=(…, 2.) then flower-type=Iris-setosa
Support = 25%
Confidence = 100%
Discovery Using Rule Association
What services are purchased together?
What products or transactions are executed by customers on a single visit to your website?
What are the relationships in the data?
Parallel Coordinates - Visualization
Each vertical line represents a field with the minimum and maximum values represented at bottom and top.
Each record has a line that connects it to the its value at each field
Lines are colored based on the output field
Clicking on the label boxes allows the lines to be rearranged
Zooming is accomplished by dragging a box over the desired area. Clicking returns to the original view.
Scatterplots - Visualization
Image To Knowledge (I2K): Data Visualization
Hyperspectral image with 120 bands
Image To Knowledge (I2K): Visualization of Results
Classification Results
Class labels per pixel
Class labels per geographical entity
Class labels of aggregations
Alignment Results
Overlays
Summary Charts
Image Operations
Enhancements
Image Restoration
Filtering
T2K - Text to Knowledge: Topic Evolution
Any chronologically ordered text
News feeds
Email
Protein Consumption Dynamics
Objective
To understand, through database visualization, global protein consumption patterns by providing a means to directly compare historical and simulated data.
Presented at the Global Soy Forum - 1999
Data Comparison, Reduction & Synthesis
Goal
Development of a 3D visualization tool for multi-channel on-board sensor data. This tools allows for multiple time series comparison, reduction and synthesis.
0 comments
Post a comment