CSCI101
Algorithms III
(Searching, Clustering, Classification)
Overview
•Searching Algorithms
•Linear Search
•Binary Search
•Clustering Algorithms
•K-Means
•Classification Algorithms
•K-NN
Searching
● The process used to find the location of a target among a list of
objects
● Searching an array finds the index of first element in an array
containing that value
Linear Search(Sequential)
● Uses a loop to step through an array, starts with the first
element
● Compares each element with the value being searched for(Key),
and stops when either the value is found or the end of the array
is reached (when element not found).
● Since the array elements are stored in linear order, searching
the element in the linear order makes it easy and efficient.
●
Linear Search(Sequential)
● Advantages:
● Simple
● Easy to understand and implement
● Doesn't require the data in the array to be sorted
● Disadvantages:
● Poor efficiency: takes a lot of comparisons to find a key in
big files
● The performance of the algorithm scales linearly with the
size of the input array
Linear Search(Example)
Binary Search
● Sorted array searching algorithm
● Algorithm
1. The initial search region is the whole array.
2. Look at the data value in the middle of the search region
3. If you’ve found your target, stop
4. If your target is less than the middle data value, the new search
region is the lower half of the data
5. If your target is greater than the middle data value, the new
search region is the higher half of the data.
6. Continue from Step 2
Binary Search (Example)
Linear Search (Example)
Binary Search (Example)
Clustering
● Clustering is concerned with grouping together
objects that are similar to each other and
dissimilar to the objects belonging to other
clusters.
● Examples:
– In a medical application we might wish to find clusters of patients
with similar symptoms.
– In a document retrieval application we might wish to find clusters
of documents with related content.
– In an economics application we might be interested in finding
countries whose economies are similar.
Clustering Example
11
K-Means Clustering
● k-means clustering is an exclusive clustering
algorithm. Each object is assigned to precisely
one of a set of clusters. (There are other methods
that allow objects to be in more than one
cluster.)
● For this method of clustering we start by deciding
how many clusters k we would like to form from
our data.
● The value of k is generally a small integer, such as
2, 3, 4 or 5, but may be larger.
The k-Means Clustering Algorithm
1. Choose a value of k.
2. Select k objects in an arbitrary fashion. Use
these as the initial set of k centroids.
3. Assign each of the objects to the cluster for
which it is nearest to the centroid.
4. Recalculate the centroids of the k clusters.
5. Repeat steps 3 and 4 until the centroids no
longer move.
Example (k=3)
14
Initial Clusters
Revised Cluster
Third Set of Clusters
These are the same clusters as before. Their centroids will be the
same as those from which the clusters were generated. Hence the
termination condition of the k-means algorithm has been met and
these are the final clusters produced by the algorithm for the initial
choice of centroids made.
Other points to consider
● Initial selection affect the K-Mean results
● Outliers should be removed first
● Normalize the data
● Euclidean distance does not make sense in
some cases, so select the proper closeness
measure.
Classification
● Classification is dividing up objects so that each is
assigned to one of a number of mutually
exhaustive and exclusive categories known as
classes.
● Examples:
– customers who are likely to buy or not buy a particular
product in a supermarket
– people who are at high, medium or low risk of acquiring a
certain illness
– people who closely resemble, slightly resemble or do not
resemble someone seen committing a crime
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund Marital
Status
Taxable
Income Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?
10
Test
Set
Training
Set
Model
Learn
Classifier
K-Nearest Neighbor (K-NN) Classifier
The algorithm can be summarized as:
● A positive integer k is specified, along with a
new sample (k= 1, 3, 5)
● We select the k entries in our training data set
which are closest to the new sample
● We find the most common classification of
these entries
● This is the classification we give to the new
sample
Training Data Set
● Two classes
● Two attributes
● How to classify (9.1,11)
5-NN Classifier
● The five nearest
neighbours are labelled
with three + signs and two
− signs,
● so a basic 5-NN classifier
would classify the unseen
instance as ‘positive’ by a
form of majority voting.
Effect of K
https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-62214cea29c7

Csci101 lect10 algorithms_iii

  • 1.
  • 2.
    Overview •Searching Algorithms •Linear Search •BinarySearch •Clustering Algorithms •K-Means •Classification Algorithms •K-NN
  • 3.
    Searching ● The processused to find the location of a target among a list of objects ● Searching an array finds the index of first element in an array containing that value
  • 4.
    Linear Search(Sequential) ● Usesa loop to step through an array, starts with the first element ● Compares each element with the value being searched for(Key), and stops when either the value is found or the end of the array is reached (when element not found). ● Since the array elements are stored in linear order, searching the element in the linear order makes it easy and efficient. ●
  • 5.
    Linear Search(Sequential) ● Advantages: ●Simple ● Easy to understand and implement ● Doesn't require the data in the array to be sorted ● Disadvantages: ● Poor efficiency: takes a lot of comparisons to find a key in big files ● The performance of the algorithm scales linearly with the size of the input array
  • 6.
  • 7.
    Binary Search ● Sortedarray searching algorithm ● Algorithm 1. The initial search region is the whole array. 2. Look at the data value in the middle of the search region 3. If you’ve found your target, stop 4. If your target is less than the middle data value, the new search region is the lower half of the data 5. If your target is greater than the middle data value, the new search region is the higher half of the data. 6. Continue from Step 2
  • 8.
  • 9.
  • 10.
    Clustering ● Clustering isconcerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. ● Examples: – In a medical application we might wish to find clusters of patients with similar symptoms. – In a document retrieval application we might wish to find clusters of documents with related content. – In an economics application we might be interested in finding countries whose economies are similar.
  • 11.
  • 12.
    K-Means Clustering ● k-meansclustering is an exclusive clustering algorithm. Each object is assigned to precisely one of a set of clusters. (There are other methods that allow objects to be in more than one cluster.) ● For this method of clustering we start by deciding how many clusters k we would like to form from our data. ● The value of k is generally a small integer, such as 2, 3, 4 or 5, but may be larger.
  • 13.
    The k-Means ClusteringAlgorithm 1. Choose a value of k. 2. Select k objects in an arbitrary fashion. Use these as the initial set of k centroids. 3. Assign each of the objects to the cluster for which it is nearest to the centroid. 4. Recalculate the centroids of the k clusters. 5. Repeat steps 3 and 4 until the centroids no longer move.
  • 14.
  • 15.
  • 16.
  • 17.
    Third Set ofClusters These are the same clusters as before. Their centroids will be the same as those from which the clusters were generated. Hence the termination condition of the k-means algorithm has been met and these are the final clusters produced by the algorithm for the initial choice of centroids made.
  • 18.
    Other points toconsider ● Initial selection affect the K-Mean results ● Outliers should be removed first ● Normalize the data ● Euclidean distance does not make sense in some cases, so select the proper closeness measure.
  • 19.
    Classification ● Classification isdividing up objects so that each is assigned to one of a number of mutually exhaustive and exclusive categories known as classes. ● Examples: – customers who are likely to buy or not buy a particular product in a supermarket – people who are at high, medium or low risk of acquiring a certain illness – people who closely resemble, slightly resemble or do not resemble someone seen committing a crime
  • 20.
    Classification Example Tid RefundMarital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund Marital Status Taxable Income Cheat No Single 75K ? Yes Married 50K ? No Married 150K ? Yes Divorced 90K ? No Single 40K ? No Married 80K ? 10 Test Set Training Set Model Learn Classifier
  • 21.
    K-Nearest Neighbor (K-NN)Classifier The algorithm can be summarized as: ● A positive integer k is specified, along with a new sample (k= 1, 3, 5) ● We select the k entries in our training data set which are closest to the new sample ● We find the most common classification of these entries ● This is the classification we give to the new sample
  • 22.
    Training Data Set ●Two classes ● Two attributes ● How to classify (9.1,11)
  • 23.
    5-NN Classifier ● Thefive nearest neighbours are labelled with three + signs and two − signs, ● so a basic 5-NN classifier would classify the unseen instance as ‘positive’ by a form of majority voting.
  • 24.