1) Sets are collections of symbols whose order does not matter, while strings are defined by the sequence or arrangement of symbols.
2) Key set problems discussed include set cover, set packing, and their applications to graph problems. Greedy algorithms provide effective heuristics for set cover.
3) String matching finds instances of a pattern string within a text string. Boyer-Moore and Knuth-Morris-Pratt provide efficient algorithms, while approximate string matching allows for errors using costs.
1. Set and String Problems
• Sets and strings both represent collections of
objects.
• difference is whether order matters.
• Sets are collections of symbols whose order is
assumed to carry no significance .
• strings are defined by the sequence or
arrangement of symbols .
2. Set and String Problems
• I will discuss fourth subjects
1- Set Cover
2- Set Packing
3- String Matching
4- Approximate String Matching
3. Set Cover
• Input description: A collection of subsets S =
{S1, . . . , Sm} of the universal setU = {1, . . . , n}.
• Problem description: What is the smallest
subset T of S whose union equalst he universal
set—i.e. , ∪|T|i=1Ti = U?
5. Set Cover
Are you allowed to cover elements more
than once?
• The distinction here is between set cover and set
packing.
set cover: allow to cover elements more than
once.
set packing: don’t allow to cover elements more
than once .
6. Set Cover
Are your sets derived from the edges or
vertices of a graph?
– Set cover is a very general problem, and
includes several useful graph problems as
special cases.
» vertex cover.
7. Set Cover & Vertex Cover
– U = {a, b, c, d, e}
S1
– S1 = {a, b} b a
– S2 = {a} S5 S2
– S3 = {d, e} d
c
– S4 = {c, e}
S4 S3
– S5 = {b, c, d} e
– O(logn).
8. Set Cover &Greedy
• Greedy is the most natural and effective
heuristic for set cover .
1. Begin by selecting the largest subset for the cover
2. and then delete all its elements from the universal set. We
add the subset containing the largest number of remaining
uncovered.
3. elements repeatedly until all are covered. This heuristic always
gives a set.
– O(ln n) .
9. Set Packing
• Input description: A set of subsets S =
{S1, . . . , Sm} of the universal set U = {1, . . . ,
n}.
• Problem description: Select (an ideally small)
collection of mutually disjoint subsets from S
whose union is the universal set.
10. Set Packing
Must every element appear in exactly one
selected subset
• we seek some collection of subsets such that each
element is covered exactly once. The airplane
scheduling problem above has the flavor of exact
covering, since every plane and crew has to be
employed.
12. String Matching
• Input description: A text string t of length
n. A pattern string p of length m.
• Problem description: Find the first (or all)
instances of pattern p in the text.
13. String Matching
• difference
– String Matching :Matching without error.
– Approximate String Matching: Matching
with error.
Spelling checkers scan an input text for
words appearing in the dictionary and
reject any strings that do not match.
14. String Matching
• Applications:
– Searching keywords in a file.
– Searching engines (like Google and Openfind).
– Database searching (GenBank).
• History of String Search
– The brute force algorithm:
• invented in the dawn of computer history
• re-invented many times, still common
• Worst O(m*n)
– KMP algorithm:
• Proposed by Knuth, Morris and Pratt in 1977.
• O(m+n) .
– Boyer-Moore Algorithm:
• Proposed by Boyer-Moore in 1977.
• O(n/m).
15. Boyer-Moore
• Compares right to left
•Boyer-Moore(Example )
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
There is no E in the pattern : thus the pattern can’t match if any characters lie
under t[3]. So, move four boxes to the right.
16. Boyer-Moore
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
N
Again, no match. But there is a B in the pattern. So move two boxes to the
right.
17. Boyer-Moore
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10]
A B C E F G A B C D E
p[0] p[1] p[2] p[3]
A B C D
Y Y Y Y
18. Knuth-Morris-Pratt
• searches for occurrences of a "word" W
within a main "text string" S
• Bypasses re-examination of previously
matched characters.
19. Knuth-Morris-Pratt
(Example)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
A B C A B C D A B A B C
p[0] p[1] p[2] p[3] p[4] p[5] p[6]
A B C D A B D
Y Y Y N
m=0
20. Knuth-Morris-Pratt
(Example)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
A B C A B C D A B A B C
p[0] p[1] p[2] p[3] p[4] p[5] p[6]
A B C D A B D
Y Y Y Y Y Y N
m=4
21. Knuth-Morris-Pratt
(Example)
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
A B C A B C D A B A B C
p[0] p[1] p[2] p[3] p[4] p[5] p[6]
A B C D A B D
N
m = 10
22. Knuth-Morris-Pratt
(Example )
t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t[10] t[11] t[12] t[13]
A B C A B C D A B A B C
p[0] p[1] p[2] ..
A B C ..
Y Y Y
m = 11
23. Approximate String Matching
• Input description: A text string t and a pattern string p.
• Problem description: What is the minimum-cost way to
transform t to p using insertions, deletions, and
substitutions?
25. Approximate String Matching
• Dynamic programming provides the basic approach
toapproximate string matching. Let D[i, j] denote the cost
of editing the first i characters of the pattern string p into
the first j characters of the text t. The recurrence follows
because we must have done something with the tail
characters pi and tj . Our only options are
matching / substituting one for the other, deleting pi, or
inserting a match for tj .Thus, D[i, j] is the minimum of
the costs of these possibilities:
1. If pi = tj then D[i − 1, j − 1] else D[i − 1, j − 1] +
substitution cost.
2. D[i − 1, j] + deletion cost of pi.
3. D[i, j − 1] + deletion cost of tj .