SlideShare a Scribd company logo
Lecture 17:
         Edit Distance
         Steven Skiena

Department of Computer Science
 State University of New York
 Stony Brook, NY 11794–4400

http://www.cs.sunysb.edu/∼skiena
Problem of the Day
Suppose you are given three strings of characters: X, Y , and
Z, where |X| = n, |Y | = m, and |Z| = n + m. Z is said to
be a shuffle of X and Y iff Z can be formed by interleaving
the characters from X and Y in a way that maintains the left-
to-right ordering of the characters from each string.
1. Show that cchocohilaptes is a shuffle of chocolate and
   chips, but chocochilatspe is not.
2. Give an efficient dynamic-programming algorithm that
   determines whether Z is a shuffle of X and Y . Hint: The
   values of the dynamic programming matrix you construct
   should be Boolean, not numeric.
Edit Distance
Misspellings make approximate pattern matching an impor-
tant problem
If we are to deal with inexact string matching, we must first
define a cost function telling us how far apart two strings are,
i.e., a distance measure between pairs of strings.
A reasonable distance measure minimizes the cost of the
changes which have to be made to convert one string to
another.
String Edit Operations
There are three natural types of changes:
 • Substitution – Change a single character from pattern s
   to a different character in text t, such as changing “shot”
   to “spot”.
 • Insertion – Insert a single character into pattern s to help
   it match text t, such as changing “ago” to “agog”.
 • Deletion – Delete a single character from pattern s to
   help it match text t, such as changing “hour” to “our”.
Recursive Algorithm
We can compute the edit distance with recursive algorithm
using the observation that the last character in the string must
either be matched, substituted, inserted, or deleted.
If we knew the cost of editing the three pairs of smaller
strings, we could decide which option leads to the best
solution and choose that option accordingly.
We can learn this cost, through the magic of recursion:
Recursive Edit Distance Code
#define MATCH 0 (* enumerated type symbol for match *)
#define INSERT 1 (* enumerated type symbol for insert *)
#define DELETE 2 (* enumerated type symbol for delete *)

int string compare(char *s, char *t, int i, int j)
{
        int k; (* counter *)
        int opt[3]; (* cost of the three options *)
        int lowest cost; (* lowest cost *)

      if (i == 0) return(j * indel(’ ’));
      if (j == 0) return(i * indel(’ ’));

      opt[MATCH] = string compare(s,t,i-1,j-1) + match(s[i],t[j]);
      opt[INSERT] = string compare(s,t,i,j-1) + indel(t[j]);
      opt[DELETE] = string compare(s,t,i-1,j) + indel(s[i]);

      lowest cost = opt[MATCH];
      for (k=INSERT; k<=DELETE; k++)
            if (opt[k] < lowest cost) lowest cost = opt[k];

      return( lowest cost );
}
Speeding it Up
This program is absolutely correct but takes exponential time
because it recomputes values again and again and again!
But there can only be |s| · |t| possible unique recursive calls,
since there are only that many distinct (i, j) pairs to serve as
the parameters of recursive calls.
By storing the values for each of these (i, j) pairs in a table,
we can avoid recomputing them and just look them up as
needed.
The Dynamic Programming Table
The table is a two-dimensional matrix m where each of the
|s| · |t| cells contains the cost of the optimal solution of this
subproblem, as well as a parent pointer explaining how we
got to this location:
typedef struct {
       int cost; (* cost of reaching this cell *)
       int parent; (* parent cell *)
} cell;

cell m[MAXLEN+1][MAXLEN+1]; (* dynamic programming table *)
Differences with Dynamic Programming
The dynamic programming version has three differences
from the recursive version:
 • First, it gets its intermediate values using table lookup
   instead of recursive calls.
 • Second, it updates the parent field of each cell, which
   will enable us to reconstruct the edit-sequence later.
 • Third, it is instrumented using a more general
   goal cell() function instead of just returning
   m[|s|][|t|].cost. This will enable us to apply this
   routine to a wider class of problems.
We assume that each string has been padded with an initial
blank character, so the first real character of string s sits in
s[1].
Evaluation Order
To determine the value of cell (i, j) we need three values
sitting and waiting for us, namely, the cells (i − 1, j − 1),
(i, j − 1), and (i − 1, j). Any evaluation order with this
property will do, including the row-major order used in this
program.
Think of the cells as vertices, where there is an edge (i, j) if
cell i’s value is needed to compute cell j. Any topological
sort of this DAG provides a proper evaluation order.
Dynamic Programming Edit Distance

int string compare(char *s, char *t)
{
        int i,j,k; (* counters *)
        int opt[3]; (* cost of the three options *)

      for (i=0; i<MAXLEN; i++) {
             row init(i);
             column init(i);
      }

      for (i=1; i<strlen(s); i++)
             for (j=1; j<strlen(t); j++) {
                    opt[MATCH] = m[i-1][j-1].cost + match(s[i],t[j]);
                    opt[INSERT] = m[i][j-1].cost + indel(t[j]);
                    opt[DELETE] = m[i-1][j].cost + indel(s[i]);

                    m[i][j].cost = opt[MATCH];
                    m[i][j].parent = MATCH;
                    for (k=INSERT; k<=DELETE; k++)
                          if (opt[k] < m[i][j].cost) {
                                 m[i][j].cost = opt[k];
                                 m[i][j].parent = k;
                          }
                    }
Example
Below is an example run, showing the cost and parent values
turning “thou shalt not” to “you should not” in five moves:
                  P         y    o    u    -    s   h   o   u   l    d    -    n    o    t
           T    pos    0    1    2    3    4    5   6   7   8   9   10   11   12   13   14
            :          0    1    2    3    4    5   6   7   8   9   10   11   12   13   14
           t:    1     1    1    2    3    4    5   6   7   8   9   10   11   12   13   13
           h:    2     2    2    2    3    4    5   5   6   7   8    9   10   11   12   13
           o:    3     3    3    2    3    4    5   6   5   6   7    8    9   10   11   12
           u:    4     4    4    3    2    3    4   5   6   5   6    7    8    9   10   11
           -:    5     5    5    4    3    2    3   4   5   6   6    7    7    8    9   10
           s:    6     6    6    5    4    3    2   3   4   5   6    7    8    8    9   10
           h:    7     7    7    6    5    4    3   2   3   4   5    6    7    8    9   10
           a:    8     8    8    7    6    5    4   3   3   4   5    6    7    8    9   10
           l:    9     9    9    8    7    6    5   4   4   4   4    5    6    7    8    9
           t:   10    10   10    9    8    7    6   5   5   5   5    5    6    7    8    8
           -:   11    11   11   10    9    8    7   6   6   6   6    6    5    6    7    8
           n:   12    12   12   11   10    9    8   7   7   7   7    7    6    5    6    7
           o:   13    13   13   12   11   10    9   8   7   8   8    8    7    6    5    6
           t:   14    14   14   13   12   11   10   9   8   8   9    9    8    7    6    5


The edit sequence from “thou-shalt-not” to “you-should-not”
is DSMMMMMISMSMMMM
Reconstructing the Path
Solutions to a given dynamic programming problem are
described by paths through the dynamic programming matrix,
starting from the initial configuration (the pair of empty
strings (0, 0)) down to the final goal state (the pair of full
strings (|s|, |t|)).
Reconstructing these decisions is done by walking backward
from the goal state, following the parent pointer to an
earlier cell. The parent field for m[i,j] tells us whether
the transform at (i, j) was MATCH, INSERT, or DELETE.
Walking backward reconstructs the solution in reverse order.
However, clever use of recursion can do the reversing for us:
Reconstruct Path Code
reconstruct path(char *s, char *t, int i, int j)
{
      if (m[i][j].parent == -1) return;

      if (m[i][j].parent == MATCH) {
             reconstruct path(s,t,i-1,j-1);
             match out(s, t, i, j);
             return;
      }
      if (m[i][j].parent == INSERT) {
             reconstruct path(s,t,i,j-1);
             insert out(t,j);
             return;
      }
      if (m[i][j].parent == DELETE) {
             reconstruct path(s,t,i-1,j);
             delete out(s,i);
             return;
      }
}
Customizing Edit Distance

 • Table Initialization – The functions row init() and
   column init() initialize the zeroth row and column
   of the dynamic programming table, respectively.
 • Penalty Costs – The functions match(c,d) and
   indel(c) present the costs for transforming character
   c to d and inserting/deleting character c. For edit distance,
   match costs nothing if the characters are identical, and 1
   otherwise, while indel always returns 1.
 • Goal Cell Identification – The function goal cell
   returns the indices of the cell marking the endpoint of the
solution. For edit distance, this is defined by the length of
 the two input strings.
• Traceback Actions – The functions match out,
  insert out, and delete out perform the appropri-
  ate actions for each edit-operation during traceback. For
  edit distance, this might mean printing out the name of
  the operation or character involved, as determined by the
  needs of the application.
Substring Matching
Suppose that we want to find where a short pattern s best
occurs within a long text t, say, searching for “Skiena” in all
its misspellings (Skienna, Skena, Skina, . . . ).
Plugging this search into our original edit distance function
will achieve little sensitivity, since the vast majority of any
edit cost will be that of deleting the body of the text.
We want an edit distance search where the cost of starting
the match is independent of the position in the text, so that a
match in the middle is not prejudiced against.
Likewise, the goal state is not necessarily at the end of both
strings, but the cheapest place to match the entire pattern
somewhere in the text.
Customizations
row init(int i)
{
      m[0][i].cost = 0; (* note change *)
      m[0][i].parent = -1; (* note change *)
}

goal cell(char *s, char *t, int *i, int *j)
{
      int k; (* counter *)

      *i = strlen(s) - 1;
      *j = 0;
      for (k=1; k<strlen(t); k++)
             if (m[*i][k].cost < m[*i][*j].cost) *j = k;
}
Longest Common Subsequence
The longest common subsequence (not substring) between
“democrat” and “republican” is eca.
A common subsequence is defined by all the identical-
character matches in an edit trace. To maximize the number
of such traces, we must prevent substitution of non-identical
characters.
int match(char c, char d)
{
      if (c == d) return(0);
      else return(MAXLEN);
}
Maximum Monotone Subsequence
A numerical sequence is monotonically increasing if the ith
element is at least as big as the (i − 1)st element.
The maximum monotone subsequence problem seeks to
delete the fewest number of elements from an input string
S to leave a monotonically increasing subsequence.
Thus a longest increasing subsequence of “243517698” is
“23568.”
Reduction to LCS
In fact, this is just a longest common subsequence problem,
where the second string is the elements of S sorted in
increasing order.
Any common sequence of these two must (a) represent
characters in proper order in S, and (b) use only characters
with increasing position in the collating sequence, so the
longest one does the job.

More Related Content

What's hot

6.2 the indefinite integral
6.2 the indefinite integral 6.2 the indefinite integral
6.2 the indefinite integral
dicosmo178
 
Insersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in AlgoritmInsersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in Algoritm
Ehsan Ehrari
 
Lecture 18 antiderivatives - section 4.8
Lecture 18   antiderivatives - section 4.8Lecture 18   antiderivatives - section 4.8
Lecture 18 antiderivatives - section 4.8
njit-ronbrown
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and ScalaFolding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Philip Schwarz
 
Operation on functions
Operation on functionsOperation on functions
Operation on functions
Jeralyn Obsina
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
JelaiAujero
 
Recursion
RecursionRecursion
Recursion
James Wong
 
Functions
FunctionsFunctions
Decreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umraoDecreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umrao
ssuserd6b1fd
 
Karnaugh
KarnaughKarnaugh
Karnaugh
camilosena
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
Gopi Saiteja
 
Chemistry Assignment Help
Chemistry Assignment Help Chemistry Assignment Help
Chemistry Assignment Help
Edu Assignment Help
 
Problem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodProblem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element Method
Peter Herbert
 
Recursion
RecursionRecursion
Recursion
Ssankett Negi
 
Lesson 21: Antiderivatives (slides)
Lesson 21: Antiderivatives (slides)Lesson 21: Antiderivatives (slides)
Lesson 21: Antiderivatives (slides)
Matthew Leingang
 
What is meaning of epsilon and delta in limits of a function by Arun Umrao
What is meaning of epsilon and delta in limits of a function by Arun UmraoWhat is meaning of epsilon and delta in limits of a function by Arun Umrao
What is meaning of epsilon and delta in limits of a function by Arun Umrao
ssuserd6b1fd
 
Ch7
Ch7Ch7
Backtracking
BacktrackingBacktracking
Backtracking
Sally Salem
 

What's hot (18)

6.2 the indefinite integral
6.2 the indefinite integral 6.2 the indefinite integral
6.2 the indefinite integral
 
Insersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in AlgoritmInsersion & Bubble Sort in Algoritm
Insersion & Bubble Sort in Algoritm
 
Lecture 18 antiderivatives - section 4.8
Lecture 18   antiderivatives - section 4.8Lecture 18   antiderivatives - section 4.8
Lecture 18 antiderivatives - section 4.8
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and ScalaFolding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
 
Operation on functions
Operation on functionsOperation on functions
Operation on functions
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
 
Recursion
RecursionRecursion
Recursion
 
Functions
FunctionsFunctions
Functions
 
Decreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umraoDecreasing and increasing functions by arun umrao
Decreasing and increasing functions by arun umrao
 
Karnaugh
KarnaughKarnaugh
Karnaugh
 
Ee693 questionshomework
Ee693 questionshomeworkEe693 questionshomework
Ee693 questionshomework
 
Chemistry Assignment Help
Chemistry Assignment Help Chemistry Assignment Help
Chemistry Assignment Help
 
Problem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element MethodProblem Solving by Computer Finite Element Method
Problem Solving by Computer Finite Element Method
 
Recursion
RecursionRecursion
Recursion
 
Lesson 21: Antiderivatives (slides)
Lesson 21: Antiderivatives (slides)Lesson 21: Antiderivatives (slides)
Lesson 21: Antiderivatives (slides)
 
What is meaning of epsilon and delta in limits of a function by Arun Umrao
What is meaning of epsilon and delta in limits of a function by Arun UmraoWhat is meaning of epsilon and delta in limits of a function by Arun Umrao
What is meaning of epsilon and delta in limits of a function by Arun Umrao
 
Ch7
Ch7Ch7
Ch7
 
Backtracking
BacktrackingBacktracking
Backtracking
 

Similar to Skiena algorithm 2007 lecture17 edit distance

C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview Questions
Gradeup
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
Programming Homework Help
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
zukun
 
Project2
Project2Project2
Project2
Linjun Li
 
Unit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptxUnit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptx
ssuser01e301
 
Mit6 006 f11_quiz1
Mit6 006 f11_quiz1Mit6 006 f11_quiz1
Mit6 006 f11_quiz1
Sandeep Jindal
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
ssusera1eccd
 
Fst ch3 notes
Fst ch3 notesFst ch3 notes
Fst ch3 notes
David Tchozewski
 
Ip 5 discrete mathematics
Ip 5 discrete mathematicsIp 5 discrete mathematics
Ip 5 discrete mathematics
Mark Simon
 
algorithm Unit 3
algorithm Unit 3algorithm Unit 3
algorithm Unit 3
Monika Choudhery
 
C1 - Insertion Sort
C1 - Insertion SortC1 - Insertion Sort
C1 - Insertion Sort
Juan Zamora, MSc. MBA
 
Discrete Math IP4 - Automata Theory
Discrete Math IP4 - Automata TheoryDiscrete Math IP4 - Automata Theory
Discrete Math IP4 - Automata Theory
Mark Simon
 
Ee693 sept2014quizgt2
Ee693 sept2014quizgt2Ee693 sept2014quizgt2
Ee693 sept2014quizgt2
Gopi Saiteja
 
Matlab1
Matlab1Matlab1
Matlab1
guest8ba004
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programming
zukun
 
Amcat automata questions
Amcat   automata questionsAmcat   automata questions
Amcat automata questions
ESWARANM92
 
Amcat automata questions
Amcat   automata questionsAmcat   automata questions
Amcat automata questions
ESWARANM92
 
Design and Analysis of algorithms
Design and Analysis of algorithmsDesign and Analysis of algorithms
Design and Analysis of algorithms
Dr. Rupa Ch
 
Tutorial 2
Tutorial     2Tutorial     2
Tutorial 2
Mohamed Yaser
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Seokhwan Kim
 

Similar to Skiena algorithm 2007 lecture17 edit distance (20)

C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview Questions
 
Algorithm Assignment Help
Algorithm Assignment HelpAlgorithm Assignment Help
Algorithm Assignment Help
 
Skiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracingSkiena algorithm 2007 lecture15 backtracing
Skiena algorithm 2007 lecture15 backtracing
 
Project2
Project2Project2
Project2
 
Unit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptxUnit-1 Basic Concept of Algorithm.pptx
Unit-1 Basic Concept of Algorithm.pptx
 
Mit6 006 f11_quiz1
Mit6 006 f11_quiz1Mit6 006 f11_quiz1
Mit6 006 f11_quiz1
 
PCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdfPCB_Lect02_Pairwise_allign (1).pdf
PCB_Lect02_Pairwise_allign (1).pdf
 
Fst ch3 notes
Fst ch3 notesFst ch3 notes
Fst ch3 notes
 
Ip 5 discrete mathematics
Ip 5 discrete mathematicsIp 5 discrete mathematics
Ip 5 discrete mathematics
 
algorithm Unit 3
algorithm Unit 3algorithm Unit 3
algorithm Unit 3
 
C1 - Insertion Sort
C1 - Insertion SortC1 - Insertion Sort
C1 - Insertion Sort
 
Discrete Math IP4 - Automata Theory
Discrete Math IP4 - Automata TheoryDiscrete Math IP4 - Automata Theory
Discrete Math IP4 - Automata Theory
 
Ee693 sept2014quizgt2
Ee693 sept2014quizgt2Ee693 sept2014quizgt2
Ee693 sept2014quizgt2
 
Matlab1
Matlab1Matlab1
Matlab1
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programming
 
Amcat automata questions
Amcat   automata questionsAmcat   automata questions
Amcat automata questions
 
Amcat automata questions
Amcat   automata questionsAmcat   automata questions
Amcat automata questions
 
Design and Analysis of algorithms
Design and Analysis of algorithmsDesign and Analysis of algorithms
Design and Analysis of algorithms
 
Tutorial 2
Tutorial     2Tutorial     2
Tutorial 2
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
 

More from zukun

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
zukun
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
zukun
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
zukun
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
zukun
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
zukun
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
zukun
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
zukun
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
zukun
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
zukun
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
zukun
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
zukun
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
zukun
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
zukun
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
zukun
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
zukun
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
zukun
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
zukun
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
zukun
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
zukun
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
zukun
 

More from zukun (20)

My lyn tutorial 2009
My lyn tutorial 2009My lyn tutorial 2009
My lyn tutorial 2009
 
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCVETHZ CV2012: Tutorial openCV
ETHZ CV2012: Tutorial openCV
 
ETHZ CV2012: Information
ETHZ CV2012: InformationETHZ CV2012: Information
ETHZ CV2012: Information
 
Siwei lyu: natural image statistics
Siwei lyu: natural image statisticsSiwei lyu: natural image statistics
Siwei lyu: natural image statistics
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Brunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer visionBrunelli 2008: template matching techniques in computer vision
Brunelli 2008: template matching techniques in computer vision
 
Modern features-part-4-evaluation
Modern features-part-4-evaluationModern features-part-4-evaluation
Modern features-part-4-evaluation
 
Modern features-part-3-software
Modern features-part-3-softwareModern features-part-3-software
Modern features-part-3-software
 
Modern features-part-2-descriptors
Modern features-part-2-descriptorsModern features-part-2-descriptors
Modern features-part-2-descriptors
 
Modern features-part-1-detectors
Modern features-part-1-detectorsModern features-part-1-detectors
Modern features-part-1-detectors
 
Modern features-part-0-intro
Modern features-part-0-introModern features-part-0-intro
Modern features-part-0-intro
 
Lecture 02 internet video search
Lecture 02 internet video searchLecture 02 internet video search
Lecture 02 internet video search
 
Lecture 01 internet video search
Lecture 01 internet video searchLecture 01 internet video search
Lecture 01 internet video search
 
Lecture 03 internet video search
Lecture 03 internet video searchLecture 03 internet video search
Lecture 03 internet video search
 
Icml2012 tutorial representation_learning
Icml2012 tutorial representation_learningIcml2012 tutorial representation_learning
Icml2012 tutorial representation_learning
 
Advances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer visionAdvances in discrete energy minimisation for computer vision
Advances in discrete energy minimisation for computer vision
 
Gephi tutorial: quick start
Gephi tutorial: quick startGephi tutorial: quick start
Gephi tutorial: quick start
 
EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysisEM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in probabilistic latent semantic analysis
 
Object recognition with pictorial structures
Object recognition with pictorial structuresObject recognition with pictorial structures
Object recognition with pictorial structures
 
Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities Iccv2011 learning spatiotemporal graphs of human activities
Iccv2011 learning spatiotemporal graphs of human activities
 

Recently uploaded

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 

Recently uploaded (20)

Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 

Skiena algorithm 2007 lecture17 edit distance

  • 1. Lecture 17: Edit Distance Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794–4400 http://www.cs.sunysb.edu/∼skiena
  • 2. Problem of the Day Suppose you are given three strings of characters: X, Y , and Z, where |X| = n, |Y | = m, and |Z| = n + m. Z is said to be a shuffle of X and Y iff Z can be formed by interleaving the characters from X and Y in a way that maintains the left- to-right ordering of the characters from each string. 1. Show that cchocohilaptes is a shuffle of chocolate and chips, but chocochilatspe is not.
  • 3. 2. Give an efficient dynamic-programming algorithm that determines whether Z is a shuffle of X and Y . Hint: The values of the dynamic programming matrix you construct should be Boolean, not numeric.
  • 4. Edit Distance Misspellings make approximate pattern matching an impor- tant problem If we are to deal with inexact string matching, we must first define a cost function telling us how far apart two strings are, i.e., a distance measure between pairs of strings. A reasonable distance measure minimizes the cost of the changes which have to be made to convert one string to another.
  • 5. String Edit Operations There are three natural types of changes: • Substitution – Change a single character from pattern s to a different character in text t, such as changing “shot” to “spot”. • Insertion – Insert a single character into pattern s to help it match text t, such as changing “ago” to “agog”. • Deletion – Delete a single character from pattern s to help it match text t, such as changing “hour” to “our”.
  • 6. Recursive Algorithm We can compute the edit distance with recursive algorithm using the observation that the last character in the string must either be matched, substituted, inserted, or deleted. If we knew the cost of editing the three pairs of smaller strings, we could decide which option leads to the best solution and choose that option accordingly. We can learn this cost, through the magic of recursion:
  • 7. Recursive Edit Distance Code #define MATCH 0 (* enumerated type symbol for match *) #define INSERT 1 (* enumerated type symbol for insert *) #define DELETE 2 (* enumerated type symbol for delete *) int string compare(char *s, char *t, int i, int j) { int k; (* counter *) int opt[3]; (* cost of the three options *) int lowest cost; (* lowest cost *) if (i == 0) return(j * indel(’ ’)); if (j == 0) return(i * indel(’ ’)); opt[MATCH] = string compare(s,t,i-1,j-1) + match(s[i],t[j]); opt[INSERT] = string compare(s,t,i,j-1) + indel(t[j]); opt[DELETE] = string compare(s,t,i-1,j) + indel(s[i]); lowest cost = opt[MATCH]; for (k=INSERT; k<=DELETE; k++) if (opt[k] < lowest cost) lowest cost = opt[k]; return( lowest cost ); }
  • 8. Speeding it Up This program is absolutely correct but takes exponential time because it recomputes values again and again and again! But there can only be |s| · |t| possible unique recursive calls, since there are only that many distinct (i, j) pairs to serve as the parameters of recursive calls. By storing the values for each of these (i, j) pairs in a table, we can avoid recomputing them and just look them up as needed.
  • 9. The Dynamic Programming Table The table is a two-dimensional matrix m where each of the |s| · |t| cells contains the cost of the optimal solution of this subproblem, as well as a parent pointer explaining how we got to this location: typedef struct { int cost; (* cost of reaching this cell *) int parent; (* parent cell *) } cell; cell m[MAXLEN+1][MAXLEN+1]; (* dynamic programming table *)
  • 10. Differences with Dynamic Programming The dynamic programming version has three differences from the recursive version: • First, it gets its intermediate values using table lookup instead of recursive calls. • Second, it updates the parent field of each cell, which will enable us to reconstruct the edit-sequence later. • Third, it is instrumented using a more general goal cell() function instead of just returning m[|s|][|t|].cost. This will enable us to apply this routine to a wider class of problems.
  • 11. We assume that each string has been padded with an initial blank character, so the first real character of string s sits in s[1].
  • 12. Evaluation Order To determine the value of cell (i, j) we need three values sitting and waiting for us, namely, the cells (i − 1, j − 1), (i, j − 1), and (i − 1, j). Any evaluation order with this property will do, including the row-major order used in this program. Think of the cells as vertices, where there is an edge (i, j) if cell i’s value is needed to compute cell j. Any topological sort of this DAG provides a proper evaluation order.
  • 13. Dynamic Programming Edit Distance int string compare(char *s, char *t) { int i,j,k; (* counters *) int opt[3]; (* cost of the three options *) for (i=0; i<MAXLEN; i++) { row init(i); column init(i); } for (i=1; i<strlen(s); i++) for (j=1; j<strlen(t); j++) { opt[MATCH] = m[i-1][j-1].cost + match(s[i],t[j]); opt[INSERT] = m[i][j-1].cost + indel(t[j]); opt[DELETE] = m[i-1][j].cost + indel(s[i]); m[i][j].cost = opt[MATCH]; m[i][j].parent = MATCH; for (k=INSERT; k<=DELETE; k++) if (opt[k] < m[i][j].cost) { m[i][j].cost = opt[k]; m[i][j].parent = k; } }
  • 14. Example Below is an example run, showing the cost and parent values turning “thou shalt not” to “you should not” in five moves: P y o u - s h o u l d - n o t T pos 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 t: 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 13 h: 2 2 2 2 3 4 5 5 6 7 8 9 10 11 12 13 o: 3 3 3 2 3 4 5 6 5 6 7 8 9 10 11 12 u: 4 4 4 3 2 3 4 5 6 5 6 7 8 9 10 11 -: 5 5 5 4 3 2 3 4 5 6 6 7 7 8 9 10 s: 6 6 6 5 4 3 2 3 4 5 6 7 8 8 9 10 h: 7 7 7 6 5 4 3 2 3 4 5 6 7 8 9 10 a: 8 8 8 7 6 5 4 3 3 4 5 6 7 8 9 10 l: 9 9 9 8 7 6 5 4 4 4 4 5 6 7 8 9 t: 10 10 10 9 8 7 6 5 5 5 5 5 6 7 8 8 -: 11 11 11 10 9 8 7 6 6 6 6 6 5 6 7 8 n: 12 12 12 11 10 9 8 7 7 7 7 7 6 5 6 7 o: 13 13 13 12 11 10 9 8 7 8 8 8 7 6 5 6 t: 14 14 14 13 12 11 10 9 8 8 9 9 8 7 6 5 The edit sequence from “thou-shalt-not” to “you-should-not” is DSMMMMMISMSMMMM
  • 15. Reconstructing the Path Solutions to a given dynamic programming problem are described by paths through the dynamic programming matrix, starting from the initial configuration (the pair of empty strings (0, 0)) down to the final goal state (the pair of full strings (|s|, |t|)). Reconstructing these decisions is done by walking backward from the goal state, following the parent pointer to an earlier cell. The parent field for m[i,j] tells us whether the transform at (i, j) was MATCH, INSERT, or DELETE. Walking backward reconstructs the solution in reverse order. However, clever use of recursion can do the reversing for us:
  • 16. Reconstruct Path Code reconstruct path(char *s, char *t, int i, int j) { if (m[i][j].parent == -1) return; if (m[i][j].parent == MATCH) { reconstruct path(s,t,i-1,j-1); match out(s, t, i, j); return; } if (m[i][j].parent == INSERT) { reconstruct path(s,t,i,j-1); insert out(t,j); return; } if (m[i][j].parent == DELETE) { reconstruct path(s,t,i-1,j); delete out(s,i); return; } }
  • 17. Customizing Edit Distance • Table Initialization – The functions row init() and column init() initialize the zeroth row and column of the dynamic programming table, respectively. • Penalty Costs – The functions match(c,d) and indel(c) present the costs for transforming character c to d and inserting/deleting character c. For edit distance, match costs nothing if the characters are identical, and 1 otherwise, while indel always returns 1. • Goal Cell Identification – The function goal cell returns the indices of the cell marking the endpoint of the
  • 18. solution. For edit distance, this is defined by the length of the two input strings. • Traceback Actions – The functions match out, insert out, and delete out perform the appropri- ate actions for each edit-operation during traceback. For edit distance, this might mean printing out the name of the operation or character involved, as determined by the needs of the application.
  • 19. Substring Matching Suppose that we want to find where a short pattern s best occurs within a long text t, say, searching for “Skiena” in all its misspellings (Skienna, Skena, Skina, . . . ). Plugging this search into our original edit distance function will achieve little sensitivity, since the vast majority of any edit cost will be that of deleting the body of the text. We want an edit distance search where the cost of starting the match is independent of the position in the text, so that a match in the middle is not prejudiced against. Likewise, the goal state is not necessarily at the end of both strings, but the cheapest place to match the entire pattern somewhere in the text.
  • 20. Customizations row init(int i) { m[0][i].cost = 0; (* note change *) m[0][i].parent = -1; (* note change *) } goal cell(char *s, char *t, int *i, int *j) { int k; (* counter *) *i = strlen(s) - 1; *j = 0; for (k=1; k<strlen(t); k++) if (m[*i][k].cost < m[*i][*j].cost) *j = k; }
  • 21. Longest Common Subsequence The longest common subsequence (not substring) between “democrat” and “republican” is eca. A common subsequence is defined by all the identical- character matches in an edit trace. To maximize the number of such traces, we must prevent substitution of non-identical characters. int match(char c, char d) { if (c == d) return(0); else return(MAXLEN); }
  • 22. Maximum Monotone Subsequence A numerical sequence is monotonically increasing if the ith element is at least as big as the (i − 1)st element. The maximum monotone subsequence problem seeks to delete the fewest number of elements from an input string S to leave a monotonically increasing subsequence. Thus a longest increasing subsequence of “243517698” is “23568.”
  • 23. Reduction to LCS In fact, this is just a longest common subsequence problem, where the second string is the elements of S sorted in increasing order. Any common sequence of these two must (a) represent characters in proper order in S, and (b) use only characters with increasing position in the collating sequence, so the longest one does the job.