final report.doc


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

final report.doc

  1. 1. APPLICATION OF GENETIC PROGRAMMING TOWARDS WORD ALIGNERS BENJAMIN HEILERS Department of Electrical Engineering and Computer Science University of California, Berkeley December 2004 CS 294-5 keywords: Genetic Programming, Word Aligners, Machine Translation, Machine Learning, Genetic Algorithms, Natural Language Processing, Artificial Intelligence
  2. 2. 1. Introduction This paper details the (as-of-yet-unfruitful-and-thus-determinedly-ongoing) research into the application of genetic programming towards optimizing word aligners. Popular belief holds the use of genetic programming in machine translation to be infeasible. Regardless, it is the goal of the author, admittedly due to infatuation with machine learning in general, to convince himself personally that such a wide-spread sentiment is either well-chosen or wholly amiss. A word aligner is a program coupling words in sentence pairs, in effect constructing a bilingual dictionary [1:484, 2]. Genetic Programming, a specific branch of Genetic Algorithms, is a term used to deal with search across a function space, where the natural reproductive methods – selection, mutation, crossover – are mimicked in the hopes that the great successes of evolution on living creatures may be repeated on programs [3:47-56, 4]. Genetic Algorithms deal more broadly with evolving all types of functions. Genetic Programming takes a set of programs and filters out those most resembling word aligners, to subject them to alterations in the hopes of finding yet better candidates. The process by which programs are selected tests each program on a subset of the sentence-pair corpus, thus qualifying this approach as supervised learning. Another concept appearing in this paper and warranting a definition is that of the Abstract Syntax Tree (AST), a representation for computer code which renders the code in a particularly useful format for genetically reproductive processes [5:9]. Abstract Syntax Trees are preferable due to: • it is many times simpler conceptually to apply crossover and mutation to a tree representation, than to a program code in string form. 2
  3. 3. • a design pattern, the Visitor Pattern, suggests an easily implementable approach to traversing this representation for program code [9, 10] Figure 1: An Eclipse Abstract Syntax Tree in Graphical and Textual Forms. Note in figure 1 that the Eclipse AST, the package used in this research, maintains some information within nodes (such as the operator in infix expressions), whereas some AST representations place formulate this information a child node. 2. Literature Review 3
  4. 4. There is little literature on the application of genetic algorithms to word aligners. Instead, we turn to the literature on genetic programming, where suggestions to counteract various results-limiting phenomena abound. There are a myriad of decisions to make in implementing a genetic algorithm. Fortunately, literature provides enough detailed discussion to allow for preparations against most common problems with genetic programming. Franz Rothlauf is the first to write a book on the pros and cons of various representations in genetic algorithms [7]. Like many of his colleagues, he highly suggests tree representations for genetic programming. This representation eases the implementation of mutation and crossover tremendously, compared with the traditional representation as bit strings, whereby the chances that a mutated string still resembles working code are less than slim. 4
  5. 5. public Alignment alignSentencePair (SentencePair sentencePair){ MISSING=2092010418 <= -1198423683; alignment=new Alignment(); I4=addAlignment(alignment,I4,I4,B3); B4=false; I2=numEnglishWordsInSentence(sentencePair); if (I2 < -594586326){ D2=getDouble(L3,I1); } else { addInt(L2,I1,I3); getInt(L5,I2); addBoolean(L3,I1,B2); while (I2 < 1564864814){ addDouble(L5,I5,D1); MISSING=664939021; alignment=getString(L1,I2); MISSING=311599999 * 1197784289; D1=-287916828; } } I2=numFrenchWordsInSentence(sentencePair); for (I3=0;I3 < I1;I3++){ I4=-1; D1=0; for (I5=0;I5 < I2;I5++){ D2=50 / (1 + abs(I3 - I5)); if (D2 >= D1){ D1=D2; I4=I5; } } addAlignment(alignment,I4,I3,true); } return alignment; } Figure 2: Example of Bloat. Lines resembling original file are in bold. Lines colored red are added as result of bloat phenomenon. The phenomena of bloat is widely mentioned, whereby each successive generation displays a much larger file size than the previous, yet most of the added code contribute little to no added functionality. With high rates of mutation, I have seen 450 lines (nine pages) of code introduced to an initially twenty-line file, after less than ten generations. There are several mechanisms in place to cope with bloat, as discussed later. Another commonly observed fact to cope with is over-fitting. This occurs when the genetic process is allowed to run for too long. For example, the corpus used in this research consists of 447 sentence pairs, pairing English and French sentences. If we are 5
  6. 6. to choose the first ten and evolve randomly generated programs to return alignments of these, then at some point we may theoretically find a reasonable solution which not only achieves superb results on the ten training sentence pairs, but on the 447 total sentence pairs as well. However, if we continue to evolve past this point, chances are that our population will become over fit for these ten sentences. This is similar, for example, to hoping to find the equation y = x2, but instead achieving y = 1, with training data of only (-1, 1) and (1, 1). Figure 3: After fitness is reached, over-fitting to the training data may occur. 6
  7. 7. Figure 4: Example of Over-Fitting. The solid black line is y = x2, the red dotted line is y = |x|, and the blue dashed line is y = 1. The training data is { (1, 1), (-1, 1) }, but the desired function is { (x, y) : y = x2 } The literature is also helpful in suggesting approximate values for the frequencies at which to apply mutation and crossover to members of the population, though the perfect values are apparently learned only by trial and error. 3. General Overview of Algorithm In general, instead of searching across the solution space, we utilize GP to aid in search across the function space. As the function space is of immense proportions, we randomly sample the function space, and then search through not only these functions but others similar to them. In the graph above, we may have a function y = x + 5. This would lead us to searching similar functions such as y = x + 6, y = 3x + 5, y = x 2 + 5, etc. Since it is infeasible to evaluate every possible function with similar form to y = x + 5, we must again find a method with which to decide which functions to search. This is the basic concept of genetic programming, where a desired set of (input, output) pairs is known, but we search for the function (or possibly one of many functions) which causes this return. Doing this search across program code is many times more complex than across math equations. 7
  8. 8. The flow chart shown here is exactly the order in which genetic programming is implemented in this research. An initial population is created, by taking files such as in the Appendix and sending them through several generations of high mutation. Since the current version of this GP process is still prone to producing erroneous code, Figure 5: Flow Chart of GP many more programs are generated than asked for. Each is then evaluated according to the fitness function, and those which have compile and runtime errors, (at this point mostly due to invalid arguments, incorrect casting, and undeclared variable names – see Results), is filtered out and thrown away. Thus the GP process begins with only valid programs in its initial population. From here, the population undergoes a number of iterations wherein each member is evaluated, then the next generation is selected, then crossover and mutation is allowed to occur. The rates for these are currently at 80% chance for 3-point crossover to occur (see Section 4), and 0.05% for mutation, as suggested by most literature. 8
  9. 9. 4. Details of Implementation Decisions Since each design decision is independent of each other, and many need to be presented simultaneously, I have chosen to format this section by discussing each one on its own, as orderly as possible. AST representation: The members of the population could be represented in a myriad of ways. Many people implementing genetic programming choose to create a an original representation of their function. This unfortunately, is due to the newness of the field, and hinders the progress of future work by allowing for the researchers to get bogged down in minor details which could potentially be settled already. I admire Franz Rothlauf’s efforts to correct this problem, and agree with him that the best representation for my purposes is to use an AST. This allows for an easy crossover implementation, and only necessitates moderate work to implement mutation. Generational model: The generational model of a population allows for the lifespan of each population member to be only a single generation, as opposed to the Steady-State model, which not only selects members for reproduction but also selects which member they will be replacing as well [8:134]. Both models use a constant-sized population. I have chosen to go with the generational model here because of simplicity in design. Initial Population: The usual approach is to start with a completely random set of programs. This seems unreasonable. Why create programs which construct strings and draw websites when we are looking for a program to add two numbers together? I have taken an approach for which I could not find any literature, to start with a population of programs similar to that under in the Appendix. In some cases, I have even 9
  10. 10. placed some initial code within the for-loop, to make alignments based on superficial traits. I do not rule out the possibility that this second strategy in effect may steer me into the wrong direction, by intending to run the program both with and with the other versions (entitled It is my hope that by providing some base code, the completely random results will be avoided and thus a better chance of finding an optimal solution is possible. Fitness function: The fitness function, the measure by which we decide which members yield the most desirable results, and thus have the most potential for being prototypes of our desired word aligner, seems obvious. The goal is to maximize the precision and recall while minimizing the AER, as defined in [11:1]. Thus the fitness function calls alignSentencePair of each population member on a small subset of the full corpus, and returns the weighted sum of these numbers (where w1, w2, w3 are the weights): 10 * [ w1 * P + w2 * R + w3 * (1 – AER) ] Countering Over-Fitting: An easy fix to countering over-fitting to the training data used in the fitness function is to keep the training set dynamic. I have implemented this by choosing a random set each time. Thus there is no worry of over-fitting to the specific set of sentence pairs being learned on, since there is no specific set of sentence pairs. Fitness-Proportionate Selection: There are two methods of selection in widespread use: tournament and proportionate fitness selection [3:37]. In tournament selection several tournaments are held in which the fitness is calculated and the winner of the tournament is selected for reproduction. As the fitness function here requires no 10
  11. 11. small amount of time, for matters of efficiency I chose the less computationally expensive selection process, fitness-proportionate selection. Here each member of the population is evaluated once, and then the new generation is randomly selected, with probability proportional to the fitness of the member [see figure]. The risk here generally is that with a wide variety of fitness values, those with the lower fitness values will be excluded from selection, the diversity of the population will disappear prematurely, leading to premature convergence. It is my hope that with a non-random initial population, the disparity in the fitness values will not be as dangerous as if the populations had been truly initialized randomly. Figure 6: The Proportionate Fitness Selection Process is Akin to a Game of Darts. Elitist Strategy: This is the decision to leave the best fit member of the population in the next generation, unmodified, although this does not rule out the possibility of including genetic reproductions of this member as well. Copies to file best of each generation: Looking back to the chart in Section 2, Literature Review, we acknowledge that there exists a reduction in quality when one allows the GP process to run too many generations. To alleviate this, a copy of the member with the highest fitness value of each generation is written to file in a separate folder. The end process, where we test the evolved word aligners, tests the best fit member of each generation, not solely that of the final generation. 11
  12. 12. N-point crossover: The two common methods of crossover are uniform and n- point. In uniform crossover, a single point in the genome is chosen and two members have their code swapped at this point. N-point crossover allows for this to happen at multiple points, and is much more suitable to crossover on trees, where we are not dealing with the traditional fixed-length representation as in bit strings. Protected functions and variable types: To alleviate casting problems and protect against null pointer exceptions, it is easier to Refer to WordAligner class in Appendix, which is the super class of all other word aligner classes. Confine alterations: To minimize the number of off-track members of the population, alterations to the code are kept in the area where they matter the most. The basic information needed by all word aligners is as is shown in the appendix for Every word aligner should align each French word to some word in English. Thus, only the body of the for loop is altered. Halting problem: It may occur through mutation or crossover that infinite loops are created [8:293-294]. To counter this, the fitness function makes use of threads and halts after a reasonable amount of time has elapsed. This also serves as an additional measure against excessive bloat. 5. Results As can be seen by the example code, mutation never worked as desired. Nearly every mutation results in erroneous code. It seems the large majority are casting issues (in the figure here, D3 is a double and S1 is a string, while H5 is a HashMap). Those mutated programs which do compile include useless code, such as incrementing variables 12
  13. 13. used nowhere else in the program. Without mutation, the rest of the genetic programming process is little more than searching over the different orderings of Figure 2: Some Mutation Errors Still Mishandled. the statements within the for loop, which holds no possible word aligners not already visible at a first glance. Figure 5: Initial Code and Code with "Better" Result. 13
  14. 14. Multiple runs each proved futile. As can be seen in the figure above, the best word aligner returned is only slightly improved in terms of performance. Looking at the code it appears to be a fluke, due to the off-chance that using the number of English words in place of the number of French words tends to give better recall results (due to the smaller length of English sentences in general relative to their French versions, less proposed alignments allows for less erroneous guesses). Indeed, generations proved of no use. In effect, the code as is merely stares at effectively the same program each generation, since crossover alone is not enough to introduce variety, and the initial population is not truly random (not with the mutation process in its current status). In the table below are shown the evaluation results of members of each generation with the highest fitness. Gen. Precision Recall AER Fitness 0 0.3658 0.2258 0.6864 9.0520 1 0.3658 0.2258 0.6864 9.0520 2 0.3535 0.2909 0.6678 9.7660 3 0.3658 0.2258 0.6864 9.0520 4 0.3658 0.2258 0.6864 9.0520 5 0.3658 0.2258 0.6864 9.0520 6 0.3935 0.1889 0.6966 8.8580 7 0.3658 0.2258 0.6864 9.0520 8 0.3658 0.2258 0.6864 9.0520 9 0.3658 0.2258 0.6864 9.0520 10 0.3658 0.2250 0.6864 9.0520 11 0.3658 0.2250 0.6864 9.0520 12 0.3658 0.2250 0.6864 9.0520 13 0.3658 0.2250 0.6864 9.0520 Figure 9: Table of Fitness Function Results on a GP run 6. Conclusion 14
  15. 15. In addition to deciphering the process of mutating code in a meaningful way, there are a few other tricks which I did not have time to experiment with, but think may prove useful. With regards to premature convergence, it would be interesting to add a feature whereby the mutation rate is raised greatly for a generation to promote an increase in diversity, triggered by a low standard deviation in the fitnesses. Since I was not able to get mutation working, I never actually started off with I instead started off with the likeness of the file seen in Figure 8. From what I have seen, it seems like possibly initializing from this file yields a population lacking in diversity, even with higher mutation rates in the initial generations. It would be interesting to run a comparison between initializing from this file and from It is acknowledged widely that the process of finding correct values for mutation and crossover rates, for the number of generations, for the population size, and choosing a heuristic function are all decisions which are still made by trial and error. Studying the exact effects of raising and lowering each of these values will consume quite an amount of time, but is vital before much more work can be done in the area of genetic programming in general. 15
  16. 16. Bibliography 1. Manning and Schütze. Foundations of Statistical Natural Language Processing, pg. 484 2. Automatic Construction of a Bilingual Lexicon: 3. Ghanea-Hercock. Applied Evolutionary Algorithms in Java 4. Genetic-Programming.Org: 5. Grune, Bal, Jacobs, and Langendoen. Modern Compiler Design, pg. 9, 22, 52-55 6. Langdon and Poli. Foundations of Genetic Programming 7. Rothlauf. Representations for Genetic and Evolutionary Algorithms 8. Banzhaf, Nordin, Keller, and Francone. Genetic Programming: An Introduction 9. Gamma, Helm, Johnson, and Vlissides. Design Patterns 10. Visitor Pattern: 11. Assignment 4: Word Alignment Models: Figures (original unless otherwise noted) 1. Abstract Syntax Tree in graphical and textual forms. 2. Example of Bloat. 3. After fitness is reached, overfitting to the training data may occur. [source: Schmiedle F, Drechsler N, Grosse D, Drechsler R. “Heuristic learning based on genetic programming.” Genetic Programming & Evolvable Machines, Vol. 3, Dec. 2002, pg 376] 4. Example of Over-Fitting 5. Flow Chart of GP Approach. [chart source: Sette S, Boullart L. “Genetic programming: principles and applications.” Engineering Applications of Artificial Intelligence, Vol. 14, Dec. 2001, pg 728] 16
  17. 17. 6. Proportionate Fitness Selection 7. Some Mutation Errors Still Mishandled 8. Initial Code and Code with "Better" Result. 9. Table of Fitness Function Results Appended Code Crossover: Takes two statements with parents of same type (for/for, while/while, etc.) The eclipse AST toolkit requires that each node belong to a certain tree, and thus simply switching trees is not possible, instead we clone the subtrees under their new owner with the static copySubtree(targetAST, sourceNode) method. private void crossover (int index1, Statement switch1, int index2, Statement switch2) { CompilationUnit cu1 = newPop[index1]; CompilationUnit cu2 = newPop[index2]; AST ast1 = cu1.getAST(); AST ast2 = cu2.getAST(); ASTNode p1 = switch1.getParent(); ASTNode p2 = switch2.getParent(); Statement switch1_under_ast2 = (Statement) ASTNode.copySubtree(ast2,switch1); Statement switch2_under_ast1 = (Statement) ASTNode.copySubtree(ast1,switch2); switch (p1.getNodeType()) { case ASTNode.BLOCK: List m1 = ((Block) p1).statements(); List m2 = ((Block) p2).statements(); m1.set(m1.indexOf(switch1), switch2_under_ast1); m2.set(m2.indexOf(switch2), switch1_under_ast2); break; case ASTNode.IF_STATEMENT: if (switch1.getLocationInParent().getId() .equals("elseStatement")) { ((IfStatement) p2).setElseStatement(switch1_under_ast2); ((IfStatement) p1).setElseStatement(switch2_under_ast1); } else { ((IfStatement) p2).setThenStatement(switch1_under_ast2); ((IfStatement) p1).setThenStatement(switch2_under_ast1); } break; case ASTNode.WHILE_STATEMENT: ((WhileStatement) p2).setBody(switch1_under_ast2); ((WhileStatement) p1).setBody(switch2_under_ast1); break; 17
  18. 18. case ASTNode.FOR_STATEMENT: ((ForStatement) p2).setBody(switch1_under_ast2); ((ForStatement) p1).setBody(switch2_under_ast1); break; default: throw new RuntimeException("unhandled crossover for nodeType: " + p1.getNodeType()); } } Mutation: Uses the Visitor Pattern and extends org.eclipse.jdt.internal.corext .dom.GenericVisitor with Mutator to implement mutation. Mutator is a file much too long to display here. The essentials are that it randomly changes register names and values in the code, as well occasionally inserting newly generated lines of code and making calls to safely defined methods (that is to say, a divide that checks for division by zero, etc.). public void mutate(int index) { random.nextFloat()*interchangeableTable.size(); Mutator mutator = new Mutator(seed, numRegisters); CompilationUnit cu = newPop[index]; AST ast = cu.getAST(); cu.accept(mutator); } WordAligner parent class: The following is edited due to length; redundant and obvious methods have been abbreviated. The Statistics object contains data from an initial pass over the corpus before hand, gathering data such as is used in unsupervised learning: Pr(f), Pr(e), Pr(f, e). public class WordAligner { protected WordAligner (Statistics s) { statistics = s; } public Alignment alignSentencePair(SentencePair s) { return null; } public float prob_f(String f) { return (float) statistics.prob_f(f); } public float prob_e(String e) { return (float) statistics.prob_e(e); } public float prob_e_and_f(SentencePair s, String f, String e) { 18
  19. 19. return (float) statistics.prob_f_and_e(s,f,e); } public List getFrenchWords (SentencePair s) { return s.getFrenchWords(); } public List getEnglishWords (SentencePair s) { return s.getEnglishWords(); } public float abs (float i) { return Math.abs(i); } public float numFrenchWordsInSentence (SentencePair s) { return s.getFrenchWords().size(); } public float numEnglishWordsInSentence (SentencePair s) { return s.getEnglishWords().size(); } public float getSentenceID (SentencePair s) { return s.getSentenceID(); } public boolean addAlignment(float englishPosition, float frenchPosition, boolean sure) { int e = Math.round(englishPosition); int f = Math.round(frenchPosition); alignment.addAlignment(e, f, sure); return true; } /** GET methods **/ public String getString(List L, float i) { if (L== null || L.size() == 0) return ""; if (i >= L.size()) i = L.size()-1; if (i < 0) i = 0; return (String) L.get(Math.round(i)); } Also: getBoolean, getNumber /** ADD methods **/ public boolean addString (List L, float i, String o) { if (L == null) L = new ArrayList(); if (i >= L.size()) i = L.size()-1; if (i < 0) i = 0; L.add(Math.round(i), o); return true; } Also: addBoolean, addNumber 19
  20. 20. /** FIELDS **/ public LinkedList L1 = new LinkedList(); public LinkedList L2 = new LinkedList(); public LinkedList L3 = new LinkedList(); public LinkedList L4 = new LinkedList(); public LinkedList L5 = new LinkedList(); public float N1 = 0; public float N2 = 0; public float N3 = 0; public float N4 = 0; public float N5 = 0; public float N6 = 0; public float N7 = 0; public float N8 = 0; public float N9 = 0; public float N0 = 0; public boolean B1 = true; public boolean B2 = true; public boolean B3 = true; public boolean B4 = true; public boolean B5 = true; public String S1 = ""; public String S2 = ""; public String S3 = ""; public String S4 = ""; public String S5 = ""; public Alignment alignment = new Alignment(); public static Statistics statistics; } An extension of WordAligner class: This is the base class for the random initialization. Several instances of this class are made, then subjected to many generations at a higher than normal mutation rate. Mutation occurs within the for-loop. public class random extends WordAligner { public Alignment alignSentencePair(SentencePair sentencePair) { alignment = new Alignment(); N1 = numEnglishWordsInSentence(sentencePair); N2 = numFrenchWordsInSentence(sentencePair); for (N3 = 0; N3 < N2; N3++) { B5 = addAlignment(N4, N3, true); } return alignment; } public random(Statistics s) { super(s); } } 20