Lecture 4
Upcoming SlideShare
Loading in...5
×
 

Lecture 4

on

  • 885 views

 

Statistics

Views

Total Views
885
Views on SlideShare
885
Embed Views
0

Actions

Likes
1
Downloads
21
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lecture 4 Lecture 4 Presentation Transcript

    • Data Mining UMUC CSMN 667 Lecture #4
    • Case Analysis Paper - Reminder
      • Please e-mail me your suggested topic (the application area to be researched) so that I may verify that it is okay.
    • Clarification on Case Analysis Term Paper
      • Your term paper should be a research paper on your selected topic: how has data mining been used in that topic area? etc.
      • You should not carry out the data mining exercise yourself with those data. But, you should report on the findings of other organizations who have worked in that area.
    • Math, what math?
      • Should you pay detailed attention to the math in the textbook? … Yes and No ...
        • Yes , if I have discussed it in these Lectures, or in the WebTycho discussions, or in the Lab Exercises … but you do not need to know the details of the equation derivations or proofs.
        • No , if you have not seen it in these Lectures or in WebTycho discussions or in lab exercises.
    • Lecture 4 “Classification - Part 1”
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • Data Mining: From Applications to Algorithms (perhaps useful in preparing your Case Analysis Term Paper)
    • Introduction to Classification Applications
      • Classification = to learn a function that classifies the data into a set of predefined classes.
        • predicts categorical class labels (i.e., discrete labels)
        • classifies data (constructs a model) based on the training set and on the values ( class labels ) in a classifying attribute; and then uses the model to classify new database entries.
        • Example: A bank might want to learn a function that determines whether a customer should get a loan or not. Decision trees and Bayesian classifiers are examples of classification algorithms. This is called Credit Scoring.
        • Other applications: Credit approval; Target marketing; Medical diagnosis; Outcome (e.g., Treatment) analysis.
    • Classification - a 2-Step Process
      • Model Construction ( Description ) : describing a set of predetermined classes = Build the Model.
        • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
        • The set of tuples used for model construction = the training set
        • The model is represented by classification rules, decision trees, or mathematical formulae
      • Model Usage ( Prediction ) : for classifying future or unknown objects, or for predicting missing values = Apply the Model.
        • It is important to estimate the accuracy of the model:
          • The known label of test sample is compared with the classified result from the model
          • Accuracy rate is the percentage of test set samples that are correctly classified by the model
          • Test set is chosen completely independent of the training set, otherwise over-fitting will occur
    • When to use Classification Applications?
      • If you do not know the types of objects stored in your database, then you should begin with a Clustering algorithm, to find the various clusters (classes) of objects within the DB. This is Unsupervised Learning.
      • If you already know the classes of objects in your database, then you should apply Classification algorithms, to classify all remaining (or newly added) objects in the database using the known objects as a training set. This is Supervised Learning.
      • If you are still learning about the properties of known objects in the database, then this is Semi-Supervised Learning, which may involve Neural Network techniques.
    • Supervised vs. Unsupervised Learning
      • Unsupervised learning (clustering)
        • The class labels of training data are unknown.
        • Start with a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters within the data .
      • Supervised learning (classification)
        • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class for each of the observations.
        • New data are classified based on the training set.
    • Sample Classification Problem: (1) Build the Model (Descriptive) Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Training Data Classifier (Model = rules)
    • Sample Classification Problem: (2) Apply the Model (Predictive) Tenured? (Jeff, Professor, 4) Not Perfect Classifier (rules) Test Data Unseen Data
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • True Funny Story from the Workplace *
      • "Discrepancy!" --
        • Database administrator (DBA) runs the quarterly sales report as usual. But the assistant to the CEO demands that the DBA run it again. Why? Because … “The actual sales numbers don't match the projected numbers!”
      • (*copyright: 2003 Computerworld, Inc.)
    • Issues in Classification - 1
      • Data Preparation:
        • Data cleaning
          • Preprocess data in order to reduce noise and handle missing values
        • Relevance analysis (feature selection)
          • The “interestingness problem”
          • Remove the irrelevant or redundant attributes
        • Data transformation
          • Generalize and/or normalize data (set values on a 0 to 1 scale)
          • Data are made categorical (sometimes called “nominal”. i.e., discrete)
          • If the data are continuous-valued, they should be discretized
      (continued)
    • Issues in Classification - 2
      • Handling different data types:
        • Continuous:
          • Numeric (e.g., salaries, ages, temperatures, rainfall, sales)
        • Discrete:
          • Binary (0 or 1; Yes/No; Male/Female)
          • Boolean (True/False)
          • Specific list of allowed values (e.g., zip codes; country names)
        • Categorical:
          • Non-numeric (character/text data) (e.g., people’s names)
          • Can be Ordinal (ordered) or Nominal (not ordered)
          • Reference: http://www.twocrows.com/glossary.htm#anchor311516
      • Examples of Classification Techniques :
        • Regression for continuous numeric data
        • Logistic Regression for discrete data
        • Bayesian Classification for categorical data
    • Issues in Classification - 3
      • Robustness:
        • Handling noise and missing values
      • Speed and scalability of model
        • time to construct the model
        • time to use the model
      • Scalability of implementation
        • ability to handle ever-growing databases
      • Interpretability:
        • understanding and insight provided by the model
      • Goodness of rules
        • decision tree size
        • compactness of classification rules
      • Predictive accuracy
    • Issues in Classification - 4
      • Overfitting
        • Definition: If your classifier (machine learning model) fits noise (i.e., pays attention to parts of the data that are irrelevant), then it is overfitting .
      (This diagram will be explained in next week’s lecture slides.) GOOD BAD
    • How good is a classification algorithm?
      • Different schemes for characterizing performance:
      Assigned Class A Assigned Class B Is Class A Is Class A Assigned Class A Assigned Class B Is Class B Is Class B Retrieved Not Retrieved Relevant Relevant Retrieved Not Retrieved Not Relevant Not Relevant True Positive False Negative False Positive True Negative (applies to Information Retrieval algorithms) Good Bad Bad Good (taken from Figure 4.3 in Dunham Texbook) ASSESSMENT
    • Classification Performance: Person Height Database Example True Positive True Negative False Positive False Negative 20% 25% 45% 10%
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • Statistical Algorithms
      • Regression -- a classifier (such as a decision tree) that predicts values of continuous variables.
      • Logistic Regression -- a classifier that predicts boolean (Yes/No, True/False, 0/1) values ( discrete estimators).
      • Bayesian Classification -- a probabilistic classification approach that is based on prior assumptions about the distribution of model parameters; predicts categorical values
    • (from last week): Regression
      • Regression is a predictive technique that discovers relationships between input and output patterns, where the values are continuous or real valued.
      • Many traditional statistical regression models are linear.
      • Neural networks, though biologically inspired, are in fact non-linear regression models.
      • Non-linear relationships occur in many multi-dimensional data mining applications.
    • An Example of a Regression Model
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • Bayesian Classification
      • We “discussed” Bayes Theorem last time…
    • Bayesian Classifiers
      • Bayes Theorem: P(C|X) = P(X|C) P(C) / P(X) which states …
      • posterior = (likelihood x prior) / evidence
      • P(C) = prior probability = probability that any given sample data is in class C, estimated before we have measured the sample data.
      • We wish to determine the posterior probability P(C|X) that estimates whether C is the correct class for a given set of sample data X.
    • Estimating Bayesian Classifiers
      • P(C|X) = P(X|C) P(C) / P(X) …
        • Estimate P(C j ) by counting the frequency of occurrence of each class C j in the training data set.*
        • Estimate P(X k ) by counting the frequency of occurrence of each attribute value X k in the data.*
        • Estimate P(X k | C j ) by counting how often the attribute value X k occurs in class C j in the training data set.*
        • Calculate the desired end-result P(C j | X k ) which is the classification = the probability that C j is the correct class for a data item having attribute X k .
      • (*Estimating these probabilities can be computationally very expensive for very large data sets.)
    • Example of Bayes Classification
      • Show sample database
      • Show application of Bayes theorem:
        • Use sample database as the “set of priors”
        • Use Bayes results to classify new data
    • Example of Bayesian Classification :
      • Suppose that you have a database D that contains characteristics of a large number of different kinds of cars that are sorted according to each car’s manufacturer = the car’s classification C.
      • Suppose one of the attributes X in D is the car’s “color”.
      • Measure P(C) from the frequency of different manufacturers in D.
      • Measure P(X) from the frequency of different colors among the cars in D. (This estimate is made independent of manufacturer.)
      • Measure P(X|C) from frequency of cars with color X made by manufacturer C.
      • Okay, now you see a red car flying down the beltway. What is the car’s make (manufacturer)? You can estimate the likelihood that the car is from a given manufacturer C by calculating P(C|X) via Bayes Theorem:
        • P(C|X) = P(X|C) P(C) / P(X) (Class is “C” when P(C|X) is a maximum.)
      • With only one attribute, this is a trivial result, and not very informative. However, using a larger set of attributes (e.g., two-door, with sun roof ) leads to a much better classification estimator : example of a Bayes Belief Network .
    • Sample Database for Bayes Classification Example
      • x = car color C = class of car (manufacturer) Car Database:
      • Tuple x C 1 red honda 2 blue honda 3 white honda 4 red chevy 5 blue chevy 6 white chevy 7 red toyota 8 white toyota 9 white toyota 10 red chevy 11 white ford 12 white ford 13 blue ford 14 red chevy 15 red dodge
      Some statistical results: x1 = red P(x1) = 6/15 x2 = white P(x2) = 6/15 x3 = blue P(x3) = 3/15 C1 = chevy P(C1) = 5/15 C2 = honda P(C2) = 3/15 C3 = toyota P(C3) = 3/15 C4 = ford P(C4) = 3/15 C5 = dodge P(C5) = 1/15
    • Application #1 of Bayes Theorem
      • Recall the theorem: P(C|X) = P(X|C) P(C) / P(X)
      • From last slide, we know P(C) and P(X). Calculate P(X|C) and then we can perform the classification.
      P(C | red) = P(red | C) * P(C) / P(red) P(red | chevy) = 3/5 P(red | honda) = 1/3 P(red | toyota) = 1/3 P(red | ford) = 0/3 P(red | dodge) = 1/1 Therefore ... P(chevy | red) = 3/5 * 5/15 * 15/6 = 3/6 = 50% P(honda | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17% P(toyota | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17% P(ford | red) = 0 P(dodge | red) = 1/1 * 1/15 * 15/6 = 1/6 = 17% Example #1: We see a red car. What type of car is it?
    • Results from Bayes Example #1
      • Therefore, the red car is most likely a Chevy (maybe a Camaro or Corvette? ).
      • The red car is unlikely to be a Ford.
      • We choose the most probable class as the Classification of the new data item (red car): therefore, Classification = C1 (Chevy).
    • Application #2 of Bayes Theorem
      • Recall the theorem: P(C|X) = P(X|C) P(C) / P(X)
      P(C | white) = P(white | C) * P(C) / P(white) P(white | chevy) = 1/5 P(white | honda) = 1/3 P(white | toyota) = 2/3 P(white | ford) = 2/3 P(white | dodge) = 0/1 Therefore ... P(chevy | white) = 1/5 * 5/15 * 15/6 = 1/6 = 17% P(honda | white) = 1/3 * 3/15 * 15/6 = 1/6 = 17% P(toyota | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33% P(ford | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33% P(dodge | white) = 0 Example #2: We see a white car. What type of car is it?
    • Results from Bayes Example #2
      • Therefore, the white car is equally likely to be a Ford or a Toyota.
      • The white car is unlikely to be a Dodge.
      • If we choose the most probable class as the Classification, we have a tie. You can either pick one of the two classes randomly (if you must pick). Or else weight each class 0.50 in the output classification (C3, C4), if a probabilistic classification is permitted.
    • Intuitive Interpretation of Bayes Theorem
      • Recall the theorem, for a given attribute X k and class C j :
            • P(C j | X k ) = P(X k | C j ) P(C j ) / P(X k )
      • If P(X k | C j ) is small, then it is very unlikely that class C j will ever produce an attribute value X k , and so P(C j | X k ) must likewise be small since the class is unlikely to be C j when we have X k .
      • Similarly, if P(C j ) is small, then P(C j | X k ) must be small since the class is unlikely to be C j under any circumstance.
      • Finally, if a given attribute X k is very rare in the training database, then P(X k ) will be very small, and therefore the value of P(C j | X k ) will be very large [as long as the values of P(C j ) and P(X k | C j ) are not small]. This makes sense, because if the rare value X k is seen at all, then it must imply that the class is very likely to be C j .
    • Why Use Bayesian Classification?
      • Probabilistic Learning : Allows you to calculate explicit probabilities for a hypothesis -- “learn as you go”. This is among the most practical approaches to certain types of learning problems (e.g., e-mail Spam detection).
      • Incremental : Each training example can incrementally increase/decrease the probability that a hypothesis is correct.
      • Data-Driven : Prior knowledge can be combined with observed data.
      • Probabilistic Prediction : Allows you to predict multiple hypotheses, each weighted by their own probabilities.
      • The Standard : Bayesian methods provide a standard of optimal decision-making against which other methods can be compared.
    • Naïve Bayesian Classification
      • Naïve Bayesian Classification assumes that all classes C(i) are independent of one another.
      • Naïve Bayes assumption: attribute independence
      • P(x 1 ,…,x k |C) = P(x 1 |C)·…·P(x k |C)
      • (= a simple product of probabilities)
      • P(x i |C) is estimated as the relative frequency of samples in class C for which their attribute “i” has the value “x i ”.
      • This assumes that there is no correlation in the attribute values x 1 ,…,x k ( attribute independence )
    • The Independence Hypothesis…
      • … makes the computation possible (tractable)
      • … yields optimal classifiers when satisfied
      • … but is seldom satisfied in practice, as attributes (variables) are often correlated.
      • Some approaches to overcome this limitation:
        • Bayesian networks , that combine Bayesian reasoning with causal relationships between attributes
        • Decision trees , that reason on one attribute at a time, considering most important attributes first
    • Another Bayes Classification Example: SPAM
    • E-mail volume explosion parallels the general data volume explosion! http://www.computerworld.com/printthis/2003/0,4814,86632,00.html
      • Quotes from the article:
        • “ In 2002, people around the globe created enough new information to fill 500,000 U.S. Libraries of Congress.
        • The 5 billion GB of new data works out to about 800MB per person -- the equivalent of a stack of books 9m high.
        • In addition to looking at stored data, UC Berkeley measured electronic flows of new information at 18 billion GBytes in 2002.
        • Whether that information has any value is another question.”
      • Therefore, we need intelligent tools to filter the noise (spam).
    • Bayes Classification for Spam Detection and Removal: This Email Classified as Spam – Example #1 ---- Start SpamAssassin results 5.60 points, 5 required ; * 0.2 -- BODY: Offers a limited time offer * 1.9 -- BODY: Save big money * 0.6 -- BODY: No such thing as a free lunch (1) * 0.8 -- BODY: Stop with the offers, coupons, discounts etc! * 0.1 -- BODY: Tells you how to stop further spam * 0.1 -- BODY: HTML font color is red * 0.1 -- BODY: Image tag with an ID code to identify you * 0.1 -- BODY: HTML font color is gray * 0.1 -- BODY: HTML included in message * 0.5 -- BODY: Message is 50% to 60% HTML * 0.7 -- URI: Uses a dotted-decimal IP address in URL * 0.3 -- URI: 'remove' URL contains an email address * 0.1 -- Headers include an "opt"ed phrase ---- End of SpamAssassin results Here are the results!
    • Bayes Classification for Spam Detection and Removal: This Email Classified as Spam – Example #2 ---- Start SpamAssassin results 13.10 points, 5 required ; * 0.8 -- From: does not include a real name * 1.7 -- Subject contains lots of white space * 0.0 -- Subject talks about savings * 2.9 -- BODY: Message seems to contain obscured email address (rot13) * 2.1 -- BODY: Claims you registered with some kind of partner * 0.8 -- BODY: Stop with the offers, coupons, discounts etc! * 1.7 -- BODY: Contains "Toner Cartridge" * 0.1 -- BODY: HTML font color is gray * 0.2 -- BODY: HTML has unbalanced "body" tags * 0.1 -- BODY: HTML included in message * 1.4 -- BODY: Message is 10% to 20% HTML * 1.3 -- Subject contains a unique ID ---- End of SpamAssassin results Here are the results!
    • Classification of E-mail by another method: Support Vector Machines (SVM)
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • Decision Trees 101
      • We discussed Decision Trees earlier -- Summary:
        • Has a flow-chart-like tree structure
        • Each internal node denotes a test on an attribute
        • Each branch represents an outcome of a test
        • Leaf nodes represent class labels or class distribution
      • Using a decision tree:
        • Classify an unknown sample
        • Test attribute values of the sample against the decision tree
      • Issues in building decision trees:
        • How do you start?
        • When do you stop?
      • Specific Algorithms: ID3, C4.5, C.5, CART
    • Decision Tree Terminology The Root Node branches internal node internal node Leaf Nodes
    • Building a Decision Tree
      • Decision tree generation consists of two phases:
        • Tree construction
          • At start, all the training examples are at the root.
          • Partition the examples top-down recursively based on selected attributes, using a “divide and conquer” approach.
        • Tree pruning
          • Identify and remove branches that reflect noise or “uninteresting” (insignificant) subclasses.
    • Terminating the Tree
      • Conditions for stopping the tree partitioning:
        • All samples at a given node already belong to the same class.
        • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf (= terminating node = the classification!).
        • There are no samples left.
        • The goodness measure drops below a preset threshold (this measure is determined according to different algorithms in different decision tree implementations).
    • How to Avoid Overfitting (“fitting the noise”)
      • The generated tree may overfit the training data:
        • Too many branches -- some may reflect anomalies due to noise or outliers
        • Results in poor classification accuracy for unseen samples
      • Two approaches to avoid overfitting:
        • Prepruning -- Halt tree construction early -- do not split a node if this would result in the “goodness measure” falling below a threshold (i.e., insufficient training examples)
          • Problem: Difficult to choose an appropriate threshold
        • Postpruning -- Remove branches from a “fully grown” tree -- get a sequence of progressively pruned trees
          • Use a set of data different from the training data to decide which is the “best pruned tree”
    • Why Use Decision Trees for Data Mining?
      • can use SQL queries for accessing the databases
      • can be constructed relatively faster than other methods
      • relatively faster learning speed (compared to other classification methods)
      • the tree is ultimately convertible to a set of simple and easy to understand classification rules
      • due to their intuitive graphical representation, they are easy to assimilate by humans
      • accuracy of decision tree classifiers is comparable or superior to other models
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • Decision Tree Algorithms - 1
      • Two steps:
        • Decision tree induction (construct tree from training data)
        • Application of DT to each tuple in database to find class
      • Refer to Definition 4.3 in Dunham text (page 93) -- be sure to understand this.
        • Note that the branches of the tree are sometimes referred to as arcs .
        • Each arc is labeled with a predicate (an action) that is applied to the attribute associated with the parent node.
        • Each internal node is a decision point , labeled by the attribute being tested at that point.
    • Decision Tree Algorithms - 2
      • ID3 approach:
        • “ Iterative Dichtomiser” (R.Quinlan 1979)
        • Picks predictors (nodes) and their splitting values (predicates) on the basis of an information gain metric:
          • The difference between the amount of information that is needed to make the correct prediction both before and after the split has been made.
          • If the amount of information required is much lower after the split is made, then the split is said to have “decreased the disorder of the original data”. This is good (i.e., the more ordered the data, then the more certain is our final classification.)
        • Refer to the “20 Questions” example in the textbook (page 97) -- good questions provide good DT splits: adult asks “Is it alive?”, but child asks “Is it my daddy?”.
    • Decision Tree Algorithms - 3
      • C4.5 and C5.0 approaches -- introduce numerous improvements and extensions to ID3:
        • Handles missing data in training set
        • Handles continuous data (not just categorical data)
        • Improved tree pruning (subtree replacement and subtree raising; depending on acceptable error rates)
        • Automated rule generation (allows for classification by the DT or by the rules alone)
        • ID3 tends to overfit (“a split for every attribute, an attribute for every split”), while C4.5 and C5.0 improve the information gain at each split.
        • C5.0 is the commercial version of C4.5, with proprietary rule generation algorithms. Targeted to large datasets.
    • Decision Tree Algorithms - 4
      • CART approach ( C lassification A nd R egression T rees):
        • Generates binary decision tree: only 2 children created at each node (whereas ID3 creates a child for each subcategory).
        • Time-consuming: search is made at each split to find the best binary split. Uses entropy from Shannon’s Information Theory:
          • The measure of disorder is also called the Entropy…
          • where p i is the probability of that value occurring at a node.
    • Decision Tree Algorithms - 5
      • CHAID approach (Chi-squared Automatic Interaction Detector):
        • distributed in the popular SAS and SPSS statistical packages
        • similar to CART, except that CHAID uses chi-squared test instead of entropy test to identify split points, to find best independent variables
        • all predictors must be categorical, or put into categorical form through binning the data (i.e., no continuous data)
        • accuracy of CART and CHAID are similar
        • CHAID is part of large family: AID, THAID, MAID, XAID
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • Other Classification Methods
      • K-nearest neighbor (KNN) classifier
      • Case-based reasoning (CBR)
      • Artificial Neural Networks (ANN)
      • Genetic Algorithms (GA)
      • Rough set approach
      • Fuzzy set approaches
      • … more about some of these next time …
    •  
    • Outline
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example
    • A Classification Example
      • “Sorting incoming Fish on a conveyor according to species using optical sensing”
      • Sea bass
      • Species
      • Salmon
      • Problem Analysis
      • Set up a camera and take some sample images to extract features. Possible features may include:
          • Length
          • Lightness
          • Width
          • Number and shape of fins
          • Position of the mouth, etc…
          • This is the set of all suggested features that you will need to explore for possible use in our classifier!
      • Preprocessing
        • Use a segmentation operation on the images to isolate fish images from one another and from the background
      • Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features
      • The features are passed to a classifier
    •  
      • Classification
        • Select the length of the fish as a possible feature for discrimination
    •  
      • What do we learn?
      • The length of the fish is a poor classification feature by itself! (previous slide)
      • Select the lightness of the fish’s color as a possible classification feature. (next slide)
    • This attribute still does not cleanly separate the two classes
      • Threshold decision boundary and cost relationship
        • Move our decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are misclassified as salmon!)
        • Task of decision theory
      • Adopt the lightness and add the width of the fish in some combined (transformed) variable:
      • Fish x T = [ x 1 , x 2 ]
      Lightness Width
    • This sloped line represents a reasonable Classifier
      • We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features” (i.e., we must avoid overfitting the data).
      • Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:
    • This curve is a classic example of overfitting (bad!): It might be the result of applying the SVM algorithm (SVM = Support Vector Machine)
      • However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input
      • Issue of generalization!
      • (avoiding overfitting!)
    • This curved dividing line is now a good classifier
    • What about Misclassifications? (false positives and false negatives) Sea Bass misclassified as Salmon Salmon misclassified as Sea Bass
    • Cost-sensitive Classification
      • Penalize misclassifications of one class more than the other
      • Changes decision boundaries
    • New decision boundary What if… Salmon is more expensive than Bass? What if… Bass is more expensive than Salmon? New decision boundary x* x*
    • Summary
    • Summary of Topics Covered - Week 4
      • Introduction to Classification Applications
      • Issues in Classification
      • Statistical Algorithms
        • Regression
        • Bayesian classification
      • Decision Tree Classification
      • Decision Tree Algorithms
      • Other Classification Methods
      • A Classification Example