Figure 2.2.doc.doc

399 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
399
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Figure 2.2.doc.doc

  1. 1. A Survey of Learning Methods In Learning Finite Automata (FA) Submitted By: Priscilla Chia Supervised By: Dr. Suresh K. Manandhar Final Year Project Report 1998 Computer Science Department University Of York
  2. 2. Abstract This report is a survey of learning methods used learning Finite Automata (FA). The learning issues in machine learning are highlighted and the methods surveyed are analysed according to how these issues are dealt with. The report also looks at how additional information is learnt based on given information by the teacher. We surveyed six algorithms with respect to the various learning methods employed in the learning process: building a hypothesis and evaluating the hypothesis. The methods are categorised into probabilistic and non-probabilistic. We conclude with a discussion on the ability of hypothesis towards error rectification of past experience instead of only learning new ones.
  3. 3. Acknowledgement I am very grateful to my supervisor, Dr. Suresh Manandhar, for his invaluable help and advise throughout this project. I would also like to thank my friends and family for their support especially mum and dad at home.
  4. 4. CONTENTS 1. INTRODUCTION...........................................................................................1 2. LEARNING....................................................................................................2 2.1 Learning in General...........................................................................................................................2 2.2 The Issues in Learning.......................................................................................................................5 2.3 Learning Finite Automata (FA)........................................................................................................6 2.4 Learning Framework.........................................................................................................................7 2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7 2.4.2 PAC VIEW............................................................................................................................8 2.4.3 COMPARISON.........................................................................................................................8 2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9 2.5 Results on learning finite automata................................................................................................10 3. NON-PROBABILISTIC LEARNING FOR FA.............................................11 3.1 Learning with queries......................................................................................................................13 3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14 3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21 3.1.3 DISCUSSION.........................................................................................................................27 3.2 Learning without queries................................................................................................................29 3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30 3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35 3.2.3 DISCUSSION.........................................................................................................................39 3.3 Homing Sequences in Learning FA................................................................................................42 3.3.1 HOMING SEQUENCE...............................................................................................................43 3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43 3.3.3 DISCUSSION.........................................................................................................................46 3.4 Summary (Motivation forward).....................................................................................................47 4. PROBABILISTIC LEARNING.....................................................................50 4.1 PAC learning using membership queries only..............................................................................50 4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50 4.1.2 DISCUSSION.........................................................................................................................51 4.2 Learning through model merging..................................................................................................54 4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54 4.2.2 LEARNING FA: REVISITED....................................................................................................55 4.2.3 BAYESIAN MODEL MERGING.................................................................................................57 4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60 4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64 4.2.6 DISCUSSION.........................................................................................................................67
  5. 5. 4.3 Summary...........................................................................................................................................68 4.4 Chapter Appendix............................................................................................................................70 5. CONCLUSION AND RELATED WORK.....................................................71 REFERENCES...............................................................................................74 1. INTRODUCTION...........................................................................................1 2. LEARNING....................................................................................................2 2.1 Learning in General...........................................................................................................................2 2.2 The Issues in Learning.......................................................................................................................5 2.3 Learning Finite Automata (FA)........................................................................................................6 2.4 Learning Framework.........................................................................................................................7 2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7 2.4.2 PAC VIEW............................................................................................................................8 2.4.3 COMPARISON.........................................................................................................................8 2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9 2.5 Results on learning finite automata................................................................................................10 3. NON-PROBABILISTIC LEARNING FOR FA.............................................11 3.1 Learning with queries......................................................................................................................13 3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14 3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21 3.1.3 DISCUSSION.........................................................................................................................27 3.2 Learning without queries................................................................................................................29 3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30 3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35 3.2.3 DISCUSSION.........................................................................................................................39 3.3 Homing Sequences in Learning FA................................................................................................42 3.3.1 HOMING SEQUENCE...............................................................................................................43 3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43 3.3.3 DISCUSSION.........................................................................................................................46 3.4 Summary (Motivation forward).....................................................................................................47 4. PROBABILISTIC LEARNING.....................................................................50 4.1 PAC learning using membership queries only..............................................................................50 4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50 4.1.2 DISCUSSION.........................................................................................................................51 4.2 Learning through model merging..................................................................................................54 4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
  6. 6. 4.2.2 LEARNING FA: REVISITED....................................................................................................55 4.2.3 BAYESIAN MODEL MERGING.................................................................................................57 4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60 4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64 4.2.6 DISCUSSION.........................................................................................................................67 4.3 Summary...........................................................................................................................................68 4.4 Chapter Appendix............................................................................................................................70 5. CONCLUSION AND RELATED WORK.....................................................71 REFERENCES...............................................................................................74 1. INTRODUCTION...........................................................................................1 2. LEARNING....................................................................................................2 2.1 Learning in General...........................................................................................................................2 2.2 The Issues in Learning.......................................................................................................................5 2.3 Learning Finite Automata (FA)........................................................................................................6 2.4 Learning Framework.........................................................................................................................7 2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7 2.4.2 PAC VIEW............................................................................................................................8 2.4.3 COMPARISON.........................................................................................................................8 2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9 2.5 Results on learning finite automata................................................................................................10 3. NON-PROBABILISTIC LEARNING FOR FA.............................................11 3.1 Learning with queries......................................................................................................................13 3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14 3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21 3.1.3 DISCUSSION.........................................................................................................................27 3.2 Learning without queries................................................................................................................29 3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30 3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35 3.2.3 DISCUSSION.........................................................................................................................39 3.3 Homing Sequences in Learning FA................................................................................................42 3.3.1 HOMING SEQUENCE...............................................................................................................43 3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43 3.3.3 DISCUSSION.........................................................................................................................46 3.4 Summary (Motivation forward).....................................................................................................47 4. PROBABILISTIC LEARNING.....................................................................50
  7. 7. 4.1 PAC learning using membership queries only..............................................................................50 4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50 4.1.2 DISCUSSION.........................................................................................................................51 4.2 Learning through model merging..................................................................................................54 4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54 4.2.2 LEARNING FA: REVISITED....................................................................................................55 4.2.3 BAYESIAN MODEL MERGING.................................................................................................57 4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60 4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64 4.2.6 DISCUSSION.........................................................................................................................67 4.3 Summary...........................................................................................................................................68 4.4 Chapter Appendix............................................................................................................................70 5. CONCLUSION AND RELATED WORK.....................................................71 REFERENCES...............................................................................................74 References 72
  8. 8. Introduction 1 1.Introduction The class of finite state automaton (FA) is studied from machine learning perspective which involves learning issues and the properties of that particular class. This report is a survey on the learning methods studied and employed in learning FA. We give an overview on learning in general in Section 2.1 and the issues of learning in Section 2.2 with application towards learning FA in Section 2.3. The two important frameworks employed extensively in machine learning which are learning in the limit and PAC learning, are explained in Section 2.4. The complexity of learning FA itself has been studied and the results are given in Section 2.5. This report which concerns the learning methods employed are divided into two main chapters where various learning algorithms are studied and compared. The usual non- probabilistic learning is discussed in Chapter 3 with the motivation towards probabilistic learning in Section 3.4 before the probabilistic learning is discussed in Chapter 4. The results of the surveyed is in every chapter and the conclusion with related works in machine learning is in Chapter 4. The are 6 algorithms discussed and each is referred to as L1-L6 throughout this report as the to the following algorithms: • L1: [Angluin 87] • L2: [Kearns et al 94] • L3: [Porat et al 91] • L4: [Rivest et al 87] • L5: [Anlguin 87; Natarajan 91] • L6: [Stolcke et al 94] We follow the standard definition of FA as studied in any automata theory [Hopcroft et al 79; Trakhtenbrot 73] and give the following terminology and notation that are used for any FA M: set of states, Q : the set of finite states q in FA final state : the start reached by an input string that is not recognised by M initial state, q0 : the start state for all input strings accepting state : final state which accepts a string that is not recognised by M rejecting state : final state which rejects a string transition, δ(q,a): path from a state q with symbol a from alphabet set alphabet set, A : the set with finite symbols and binary set {0,1} is applied.
  9. 9. Chapter 2: Learning 2 2.Learning 2.1Learning in General Learning in general means the ability to carry out a task with improvement from previous experience. It involves a teacher and a learner. The learning process usually takes place in an environment which constrains the communication between the learner and teacher: how the teacher is to teach or train the learner and how the learner is to receive input from the teacher; and elements or tokens of information that is communicated between the teacher and learner: a class of objects (i.e. a concept) and a description of a subclass (i.e. an object). Example 1(a): Environment for learning a class of vehicles The environment in which the learning process takes place involves a teacher giving descriptions of ships and a learner drawing a conclusion of how a ship looks like from descriptions received. The teacher describes ship (i.e. the subclass of vehicles) by providing pictures (i.e. using pictorial means) of ships and the learner responds (i.e. communicates with the teacher) through some form of visual mechanism (i.e. by detecting shapes or colours of object in pictures) to analyse pictures received from the teacher. This environment only allows the teacher and learner to communicate through pictures, whereas in another environment, other forms of encoding of descriptions may be used (e.g. tables of attributes - width, length, windows, engine capacity etc.) A c1 B : cn ] : r1 : : : cs : : cm ] ri Figure 2.1: Finite representation of a possibly infinite class A of m elements cs for 1≤ s ≤ m where m may be finite of infinite, by another finite class B with p elements, ri for 1≤ i ≤ p where p is some finite number. The learner is to learn an unknown subclass from the class with the help (i.e. some form of training) from the teacher who provides descriptions of the unknown subclass. Since the subclass to be learnt may be infinite in size, a finite representation is needed to represent the probable subclasses hypothesised during learning. The task of the learner is to hypothesise a (finite) representation of the unknown subclass as shown in Figure 2.1 where class A of a possibly finite set is represented by a finite class B. Thus learning the class A is to learn its class B representation. In Example 1(a) above, the learner is to produce a hypothesis of the ship subclass. A finite representation for the hypothesis is necessary for the unknown subclass chosen to be learnt, as not every subclass can be finitely presented or described (i.e. by presenting all elements of the unknown subclass to the learner) by the teacher, as shown in the class of vehicles above where the subclass (i.e. ships) is infinite. Note that there are finitely many
  10. 10. Chapter 2: Learning 3 different ‘types’ of ships as there are finitely many different ‘types’ of vehicles where in both cases the ‘ships’ and ‘vehicles’ are infinite but the ‘types’ are finite. Instances from the class used to describe a particular subclass are called examples. These examples are usually classified by a teacher (i.e. a human or some operation or program available in the environment) with respect to the particular subclass being learnt as positive (a member of the subclass) or negative (non-member of the subclass) examples. A set of classified or labelled examples is called the example space. Only a subset of the example space is used by learner each time an unknown subclass is to be learnt. This subset, used in training the learner, is known as the training set. Each example space contains information (implicit properties or rules that may be infinite) relevant in distinguishing one subclass from another in the given class. The constraints in the environment also determine which type (i.e. positive only, negative only or both) of examples can be provided by the teacher to form the example space. For instance, it may not be possible to collect negative examples and the teacher is restricted to only positive examples which may not be a partial set (i.e. not all members of the unknown subclass is known even to the teacher). The learner or learning algorithm is therefore required to learn the implicit properties or rules from the information given (built into what is called experience) of a particular subclass. The properties learnt are stored in the learner’s hypothesis (i.e. conclusion or explanation) drawn of the sub-class. An infinite number of hypotheses of any form of representation (i.e. decision tree, propositional logic expression, finite automata etc.) could be produced that hold the properties obtained from the information received. This results in searching a large hypothesis space. It should be noted that the hypothesis space could be expressed in the same descriptive language used to describe the unknown subclass: In Example 1(a), if the class of vehicles are represented in the form of propositional logic expressions then the hypothesis may be the exact propositional logic expression that represents the unknown subclass chosen (i.e. ships) or in some other form of representation that is equivalent to the propositional logic expressions used. A set of criteria is necessary to limit (reduce) the size of the search space. Given a reduced hypothesis space that satisfies the set of criteria, the learning goal, is then needed in selecting and justifying a hypothesis from the hypothesis space as the finite representation of the unknown subclass. Together with other knowledge about the rules to manipulate the descriptive language, the set of criteria and learning goal form what is called background knowledge to guide the learner in the learning process. Example 1(b): Learning process of Example 1(a) Suppose that the hypothesis for ships take the form of a collection of finite number of attributes for ship (i.e. size, engine capacity, shape, weight, anchor and other properties of a ship). The criteria for hypothesis space could include hypotheses that fulfilled five out of say, six attributes used and the learning goal is to be able to select hypothesis that satisfy the criteria with the simplest data structure of some form and could successfully identify subsequent say, ten examples, correctly. There could be infinite number of attribute used but the criteria in the background knowledge reduces the hypothesis space.
  11. 11. Chapter 2: Learning 4 Thus, the learning scenario (Figure 2.2) consists of a given a class, C, of subclasses and an example space, T, from where the training set, t, is drawn. Examples in T are used to describe an unknown subclass, c, in C. The aim of the learning algorithm, L, is to produce a hypothesis, h, from a hypothesis space, H, using information from t and satisfying the conditions set out in the background knowledge. L is to build an h that is equivalent to c. Ideally, h is to be exactly the same as c or h is the exact representation of c. Due to the incompleteness (i.e. teacher usually does not have complete information regarding c) of t received, h is usually taken to be equivalent to c to some extent expressed in the background knowledge. In both cases, learning relies on information contained in t and given by the teacher. L:TH where T : sets of t for a sub-class, c, in C. Also called, example space. L(t) = h (≡R c) t∈T h∈H c∈C ≡R: the equivalence relation specified by the learning goal used in selecting hypothesis, h, using t. The selected h contains learnt properties or rules of c that are obtained through information from t. C T H Background teacher t Knowledge: c h • set of criteria L • learning goal Example • type of space representation h (descriptive L language for H) Class Environment Hypothesis space Figure 2.2: The learning scenario of a learner or learning algorithm, L, with a given environment.
  12. 12. Chapter 2: Learning 5 2.2The Issues in Learning The algorithms used in learning are ‘ways’ of achieving the learning goal under the set of criteria in the background knowledge. ‘Ways’ here are methods of constructing hypothesis from information in the set of examples. As shown in Figure 2.2, the learning algorithm, L, has two distinct phases in the learning process: Phase 1: forming hypothesis, h, from set of examples, t. [shown as arrow from T to H in Figure 2.2] Phase 2: selecting and justifying h as a finite representation of the unknown subclass, c. [shown as arrow from H to C in Figure 2.2] The nature (or design) of L, and the feasibility of the learning problem itself is determined by the following factors: 1. Example space, T Usually considered arbitrary where various kind of information (training sets) can be used to describe c. 2. Classification of training set, t, usually by a teacher or operation carried out on the environment with respect to a particular sub-class, c. • Noisy examples are considered where the teacher may classify instances wrongly • Type of examples to be presented (i.e. positive only, negative only or both) 3. Presentation of t to L Whether elements from t is fed into L one-by-one or in a small groups or a whole batch of t and whether the elements are presented in any particular order (i.e. in lexicographic order or shortest length first) 4. The size of the t Intuitively, a small t is needed in learning by an efficient and ‘intelligent’ learner or learning algorithm. In machine learning, the size of t contributes to the computational complexity of a learning algorithm, the larger t is, the longer (or more complicated) is the computation. 5. The choice of representation for the hypothesis space, H This involves issues of how much information to capture and can be captured by a particular choice of representation. A rich descriptive language ideally required as representation means a more complex computation and larger resource (i.e. memory storage) requirement, whereas a simple form of representation may not capture sufficient information to learn. 6. Selection criteria of a hypothesis, h, and justification as an equivalence of c. All of the above except the last factor constitute to a major part in the design of an algorithm in machine learning, to be exhibited in Phase 1 of L. The last factor and also the choice of representation for H, are usually vital in Phase 2 of the learning process where evaluation are carried out by human experts or some known mechanism such as statistical confirmation or analysis. The learner, L, is said to be able to learn a class in the given environment if it can learn (i.e. by producing a hypothesis that satisfies both the criteria and learning goal in the background knowledge provided a priori) any subclass chosen from the class.
  13. 13. Chapter 2: Learning 6 2.3Learning Finite Automata (FA) This report investigates the learning process in a particular environment setting (Figure 2.3): - Teacher: as the source of example space, T, where description to unknown subclass, c, is of the form of labelled strings. - Learner: learns by receiving information in a form of labelled string drawn from T following rules set out in the environment constraints. - c: the unknown regular language or FA Two almost similar environments for learning are shown in Figure 2.3 with difference in the class contents. The first environment (Figure 2.3(a)) consists of: - C1: the class of all languages, - H1: the hypothesis space, H1, is finite automaton (FAs) as the finite representation for regular languages (i.e. a subclass of C1). - T: the examples are labelled sets of languages. - Criteria: FA accepts all examples (i.e. strings which may or may not be only positive strings) received from training set, t. - Goal: to produce an FA (i.e. the selected hypothesis) that is equivalent to (i.e. that accepts) c. The other learning environment (Figure 2.3(b)) can be obtained by refining C1 to be the class of regular languages only and the hypothesis space, H2, are minimum deterministic finite automata (DFA). The new environment shown in Figure 2.3(b) has more constraints added into the environment as the teacher is to provide descriptions using only regular languages as compared to C1, where the teacher is able to provide descriptions using other languages as well (i.e. context free languages). This report concerns with the learnability of finite automaton (FA) using minimum DFAs as the hypothesis space. Both environments, with C1 and C2 as classes, use the same set of examples, T, which is a set of strings and the training set, t, is a set of classified strings with respect to a particular sub-class of languages, c. For consistency, throughout the report, the alphabet, A, will for FAs will be set to the binary set {0,1}. C1 T H1 C2 T H2 (class of all (class of FAs ) (class of regular (class of languages) languages ≡ minimum DFAs) minimum DFAs) c’ h c h c (a) (b) Figure 2.3: (a) c’ is the sub-class of regular languages, c, and H1 is the class of FAs with criteria for H1 being deterministic and minimal in size (number of states). (b) c is a particular subset of regular languages and H2 is the class of minimum DFA itself where no criteria is needed.
  14. 14. Chapter 2: Learning 7 2.4Learning Framework Given an environment with a class of objects which describe ‘what is to be learnt’, the two phases, Phase 1 and Phase 2, in the learning process bring us two fundamental questions: - ‘how do we learn?’ - ‘when do we know we have learnt?’ The former is being dealt with in Phase 1 and the latter, in Phase 2, is studied by Gold [Gold 67] and Valiant [Valiant 84] resulting in two major learning frameworks being proposed, the identification in the limit by Gold and the probably approximately correct (PAC) learning by Valiant. 2.4.1Identification in the limit [Gold 67] states that learning should be a continuous process with the learner (or learning algorithm), L, having the possibility of changing or refining his guess (i.e. hypothesis) each time new information from the training set, t, is presented. The learner, L, is only required to have all his guesses after a finite time to be the same and correct with information seen so far. Hence, the hypothesis, h, obtained after a finite time will remain the same and correct with subsequent information. The hypothesis, h, is said to represent the unknown sub-class, c, described by t in the limit, completing Phase 2 of the learning process. This learning framework, identification in the limit, consists of three items as formulated by Gold: 1. A class of objects A class, C, is specified (or given) to learner in the environment where the form of communication between the teacher and learner is also specified. An object, c, from C will be chosen for the learner to identify. [In the context of this report, the unknown object (or sub-class), c, is an FA and the class C consists of FAs]. 2. A method of information presentation Information about the unknown chosen object is presented to the learner. The training set, t, consists of either positive only, positive and negative, and noisy examples as information describing c. [t is just a set of labelled strings drawn from example space, T, provided by the teacher and the type of t depends on T – all positive strings, all negative strings or combination of both] 3. A naming relation This basically enables the learner to identify the unknown object, c, by specifying its name1, h . There is a function, f, for L to map the names to the objects in C. Here, an object, c, can have several names (hypotheses) where guesses (or hypotheses are made under f). [L is to build an FA as the hypothesis, h, for an unknown regular language and h could be any of the several DFAs (or TMs) that accepts the unknown regular language]. 1 In [Gold 67], name is defined as a Turing Machine (TM). Since the language identified by FA is also identifiable by a TM, it is sufficient to say that every FA has a TM.
  15. 15. Chapter 2: Learning 8 2.4.2PAC view Another learning framework is Probably Approximately Correct (PAC) learning. This was first proposed by Valiant [Valiant 84] that uses a stochastic setting in the learning process. The learner is required to build a (approximately correct) hypothesis that has a minimal error probability after being trained using the training set, t, constituting Phase 1 of the learning process. Phase 2 under this framework requires the learner to have a high level of confidence that the hypothesis, h, is approximately correct as a representation of sub-class, c. The training set, t, is considered ‘good enough’ with high confidence level. This is appropriate because t generally does not consist of all the positive examples needed to learn c. The PAC framework relies on two parameters, accuracy (ε) and confidence limit (δ). A fixed but unknown distribution is applied over the class of examples, T, where training sets, t, are drawn at random. Intuitively, PAC learning seems like a passive type of learning with the learner learning only through observation on given data or information. However, [Angluin 88] and [Natarajan 91] showed that PAC learning can be used as an active learner using queries – equivalence, membership, subset, superset, exhaustiveness and disjointness queries [Angluin 87]. Given a real number, δ, from 0 to 1 and a real number, ε, also from 0 to 1, there is a minimum sample length (i.e. the size of training set, t) such that for any unknown subclass, c, with a fixed but unknown distribution on example space, T, there is: a (1- δ)% chance that ε % of the test set will be classified wrongly by hypothesis h, where test set is another subset of T different from t to test validity of h. PAC learning is desirable for a good approximation to c, as in most cases it is computationally difficult to build an accurate (exact) hypothesis and [Angluin 88] and [Natarajan 91] have shown that PAC learning can be easily applied to any other non- stochastic learning framework. 2.4.3Comparison Both frameworks have distinct criteria and goal for learning which deal with Phase 2 in the learning process (Table 2.1). However, they both suggest learning by building tentative hypotheses from a piece of information in the form of a string from training set, t, (Figure 2.4). Those tentative hypotheses are each a new ‘experience’ (i.e. a modified hypothesis with slight changes or totally new hypothesis) as new information is received from t. The final hypothesis, h’, taken to represent sub-class, c, may be totally different from previous hypotheses. Learning framework Identification in the limit PAC learning Goal Same hypothesis (or guess) P (h – c = ε ) < 1- δ on a after a finite time for sufficiently large sample, t. subsequent information P = the probability function received. h = hypothesis c = the unknown sub-class δ and ε are parameters needed Criteria Hypothesis (guess) made must Hypothesis, h, has minimal be consistent (correct) with error, ε, with respect to T. information seen so far. Table 2.1: Comparison between identification in the limit and PAC learning frameworks.
  16. 16. Chapter 2: Learning 9 C T H teacher t ={t1, t2, t3,..} c h1 L h2 h h3 L Environment with oracles Figure 2.4: A learning scenario with learning algorithm, L, making several tentative hypotheses (i.e. h1, h2, h3) in H from sequence of labelled examples (i.e. t1, t2, t3). Recent studies[Kearns et al 94; Rivest et al 88; Porat 91] are done under Gold’s proposed learning framework, as it is more natural to human learning. We can always change our perception (hypothesis) each time a new information is received and still being consistent with the previous information. We never know (or predict) when we finish learning (which is a perpetual process in humans). 2.4.4Other variations of learning framework There are two other learning frameworks mentioned by Gold in [Gold 67]: 1. Finite Identification The learner stops the presentation of information after a finite number of examples and identifies the sub-class, c. The learner is to know when he has acquired sufficient number of examples and therefore able to identify c. 2. Fixed-time Identification A fixed finite time2 is specified a priori (i.e. usually as background knowledge) and independently of the unknown object presented whereby the learner stops learning and identifies the unknown object. These two frameworks seem to ask too much of the learner where the learner is ‘forced’ to identify the sub-class, c, by outputting a hypothesis, h, after some predicted factor or condition is achieved. In finite identification, the learner is able to predict the number of examples needed to learn and stop learning once the predicted number of examples has been presented. On the other hand, the fixed-time identification requires the learner to know in some ways ‘when’ he is able to stop learning. Learning as mentioned earlier, is to identify or distinguish the ambiguous lines separating each sub-class in a learning environment. Being able to tell when exactly (i.e. able to predict those lines) to stop learning means that there is no need for learning to start in the first place. 2 Time is taken throughout the report, to correspond with the computational complexity and the termination of a successful learning algorithm.
  17. 17. Chapter 2: Learning 10 2.5Results on learning finite automata. The complexity and learnability of finite automaton identification have received extensive research [Gold 67; Angluin 87; Vazirani et al 88]. The computational complexity is being considered here with respect to the size of the hypothesis space (minimum DFA) searched and the size of training set (examples) required. Other complexity results that have dealt with the computational efficiency are as follows: 1. Identification in the limit and learnability model, [Gold 67] – Gold classifies the classes of languages that are learnable in the limit into three categories of information presentation (Table 2.2). Learning from positive only examples is proven to be NP-complete. 2. Inferring consistent DFA or NFA, of the size factor (1+1/8) of the minimum consistent DFA is NP-complete given positive and negatives examples. [Li et al 87] 3. There is an efficient learning algorithm to find minimum DFA consistent with given positive and negative data and access to membership and equivalence queries [Angluin 87], using observation table as a representation of FA. 4. Learning FA by experimentation3 (as in 4 above) [Kearns et al 94], using classification tree as a representation for FA in polynomial time. 5. State characterisation and Data Matrix Agreement is introduced for the problem of automaton identification [Gold 78] 6. Inferring minimum DFA’s and regular sets using positive and negative examples only is NP-complete. [Gold 67,78] and [Angluin 78] Learnability model Class of languages Anomalous Text - Recursive enumerable - Recursive Informant - Primitive recursive (using positive and negative - Context sensitive examples/instances) - Context free - Regular - Superfinite Text Finite cardinality (using positive only examples/instances) Table 2.2: Learnability and non-learnability of languages [Gold 67] where superfinite language is the class of all finite languages and one infinite regular language. These results shows that inferring DFA directly from just examples are NP-hard and some other learning methods are employed in successfully learning FA. The methods used in successfully learning FA is surveyed in the following chapters. 3 Experimentation – a form of learning where learner is able to experiment with chosen strings (i.e. selected by learner and not from training set provided) during training.
  18. 18. Chapter 3: Non-Probabilistic Learning 11 3.Non-Probabilistic Learning for FA In building a hypothesis, h, for an unknown FA, c, the learning algorithm, L, usually receives information (i.e. labelled strings) describing c from a training set, t. L is to build an h that is equivalent to c with the information it received insofar. Ideally, h is to be exactly the same as c. However in practical, as c is unknown, the teacher usually may not have complete information required to build the exact FA and h is then taken to be an approximation to c to some extent to be specified in the background knowledge (i.e. approximately equivalent to c or probably approximately correct h than the usual exact h). Learning relies on L to make several guesses based on information provided by the teacher in the following ‘ways’ to be discussed in this chapter: a) learning with queries, section 3.1 b) learning without queries, section 3.2 c) learning with homing sequences, section 3.3 L is to make guesses about c through a number of tentative hypotheses (i.e. tentative FAs), M’, from the information received. Each guess is a refinement or modification to the previous guess (hypothesis) where new properties of FA (i.e. the characteristic and elements of FA) are discovered. The guess made by L is also called a conjecture. The learner will produce several conjectures until the learning goal is achieved, that is, a final conjecture is accepted as the equivalent FA to c. All information received and properties learnt through the modifications are kept in a data structure. The modification to the data structure is called an update and a new hypothesis is built based on the updated data structure. Hence, the data structure has several roles: a) a representation of properties (to be learnt) of an FA : • the finite number of states • transitions (representing the transition function) • the set of distinguishing strings • the accepting and rejecting states b) a record of modifications made (i.e. updates) • incorporating more information received: strings in t • updating more properties learnt c) a reference to build next tentative FA, M’, after each update The data structures being used in this chapter by the learner are briefly explained below, detailed explanation on the updates are given in the relevant section in brackets: 1. observation table (see section 3.1.1) A two dimensional table, in Figure 3.1, where the rows correspond to the states and the columns correspond to the set of distinguishing strings for FA. The entries in the table are values of ‘0’ and ‘1’ corresponding to the transition function of FA to a rejecting and accepting states respectively.
  19. 19. Chapter 3: Non-Probabilistic Learning 12 e1 e2 … Distinguishing strings = {e1 , e2,…} s1 1 0 States = {s1, s2, …} s2 0 0 transition function, δ(q,x) s3 = 0 (= qx is a rejecting state) : = 1 (= qx is an accepting state) : for some string x from state q Figure 3.5: Observation table representing elements of FA: states (rows), distinguishing strings (columns) and transition function (table entries in shaded section) 2. classification tree (see section 3.1.2) A binary classification tree where the leaves correspond to the states in FA and the distinguishing strings are represented by the internal nodes (and root) of the tree, as shown in Figure 3.2. The left and right paths from an internal node correspond transition function of FA to a rejecting and accepting states States = {s1, s2, s3, …} Root (d1) Distinguishing strings = {d1, d2, …} transition function, δ(q,x) d2 = left path (= qx is a rejecting state) respectively. = right path (= qx is an accepting state) s1 for some string x from state q d3 Figure 3.6: Classification tree representing elements of an FA: states (leaves), distinguishing strings (internal nodes including root) and transition function (the right and left paths). : s2 3. minword(q) (see section 3.2.1) A string used to reach a state q in an FA from initial state q0. Thus, the set of minword(q) corresponds to the states in an FA as shown in Figure 3.3. λ q0 0 q1 0,1 0 1 1 1 0 λ 0 (a) (b) 0 q2 q3 1 q0 q1 0 minword(q0) = λ minword(q1) = 0 minword(q0) = λ minword(q2) = 1 minword(q1) = 0 minword(q3) = 01 Figure 3.7: The set of minword(q) for FAs. (a) four minword(q) representing the states in the FA that accepts all strings with even 0’s and 1’s. (b) two minword(q) representing the states in the FA that accept all non-empty strings.
  20. 20. Chapter 3: Non-Probabilistic Learning 13 3.1Learning with queries Additional information regarding the unknown c can be requested by L by asking queries [Angluin 88]. The queries can be equivalence query, membership query, subset query, superset query, disjointness query and exhaustiveness query. Two of the six queries are used in the following two algorithms, L1 and L2, (see section 3.1.1 and 3.12) in learning c: 1. Membership queries The teacher returns a yes/no answer when the learner presents an input string, x, of its choice in the query, depending upon whether x is accepted by the unknown FA, c. 2. Equivalence queries The teacher returns a yes answer if the conjecture, M’, is equivalent to c and otherwise returns a counterexample, y, which is a string in the symmetric difference of M’ and c, if M’ is not equivalent to c. Hence, L has access to some oracle (could be the teacher or from some operations available in the environment), creating an active interaction between the learner and teacher in the learning process (Figure 3.8). Both queries require a pair of oracles where each oracle is used in separate stage of learning: a) Phase 1 of learning: updating the data structure used to construct the conjecture, M’ b) Phase 2 of learning: to confirm M’ as a finite representation of c (i.e. when to stop learning) C T H teacher t c h L h L Oracle(s) Environment Figure 3.8: Learning with additional information obtained through access to oracle in the environment.
  21. 21. Chapter 3: Non-Probabilistic Learning 14 3.1.1L1: by Dana Angluin [Angluin 87] The observation table (e.g. Figure 3.5) is the data structure used to stores the information and learnt properties about the unknown FA, c. All rows and columns are represented by strings based on information from the training set t and the set of distinguishing strings learnt respectively. Each row is viewed as a vector with attributes values of ‘0’ and ‘1’ (i.e. the ‘0’ and ‘1’ table entries in each row corresponding to each column) representing a state in c. Thus, the string representing each row also represents a state in c. A string is said to represent a state q when it can be used to reach q from the initial state q0. The vectors are used to distinguish the rows, thus, distinguishing the states in c. Alternatively, each row can be viewed as a set of distinguishing strings e where each e represents a column in the table. The table entries of ‘1’ and ‘0’ in a row depends on whether e (for the corresponding column) is the distinguishing string or not to the row (state represented) respectively. There may be rows with the same vector (i.e. with the same set of distinguishing strings) and by the Myhill-Nerode Theorem of equivalence classes, these rows are said to be equivalent to each other, that is, representing the same equivalence class x. Thus, we use the alternative view of a row above in referring to the distinct states represented by these rows. The distinct state, that is, the equivalence class x, is represented by the distinct row vector. From Figure 3.5, there are only two distinct rows, s1 and s2, with vectors ‘0’ and ‘1’ and strings λ and 0 respectively. The rest of the rows have the same vector ‘1’ as row s2. Thus, there are only two distinct states represented by a set of strings {λ} and {0,1,00,01} each. The sets of distinguishing strings are φ and {λ} for the two distinct states respectively. e1 = λ Rows: s1…s5 s1 = λ 0 Columns: e1 s2 = 0 1 training set t= {-λ, +0, +1, +00, +01} s3 = 1 1 distinguishing strings = {λ} s4 = 00 1 States: s1, s2 s5 = 01 1 Figure 3.9: Observation table with five rows representing two distinct states with string from t. We now specify the three main elements in the observation table O, as shown in Figure 3.6, used by the learner L1 during learning to represent properties and information of c: 1. A non-empty prefix-closed* set of strings, S. This set starts with the null string, λ. All the rows in the observation table are each represented by strings in S∪S.A. There are two distinct divisions of rows in O: the upper division (i.e. as shown by the shaded rows in Figure 3.6) of the table is represented by the strings in S and the lower division is represented by strings in S.A. Each row in the upper division is the particular state reachable through some s∈S from the initial state q0. The * A prefix-closed set is where every prefix of each member is also an element of the set.
  22. 22. Chapter 3: Non-Probabilistic Learning 15 rows in the lower division of O therefore represent the next-states reached through transitions a∈A from rows in the upper division. Thus, S represents the states discovered (learnt) by learner in the course of learning. 2. A non-empty suffix-closed** set of string, E. This set also starts with an initial null string, λ. The columns in the observation table are represented by the strings in this set. The vector for each row is a collection of strings from E. Thus the distinct subsets of E represented by the distinct row vectors are used to identify the distinct states represented by the strings in S ∪ S.A. From Figure 3.6, each φ, {λ} ⊆ E (represented by vectors ‘0’ and ‘1’ respectively) identifies the two distinct states represented by {λ} and {0,1,00,01} in S ∪ S.A. Thus, E represents the characteristics of states learnt through subsets of strings in E. 3. A mapping function, T: (S ∪ S.A).E  {0,1} where T(x.e) = ‘1’ if the string x.e ∈ c and ‘0’ otherwise with x ∈ S ∪ S.A . Thus, this mapping function represents the transition function of FA, δ(q0,xe). E λ Rows: s1…s5 s1 = λ 0 S S = {λ, 0} s2 = 0 1 E = {λ} s3 = 1 1 S∪S.A = {λ, 0, 1, 00, 01} s4 = 00 1 S.A table entries: T(x.e) s5 = 01 1 where x∈ S ∪ S.A , e∈E Figure 3.10: Observation table O with upper division (shaded section) and lower division of rows from the set S∪S.A. Each of the following two properties of O, closed and consistent, are used by L1 as a guide to carry out updates (i.e. the extension of rows and columns) during learning: a) closed As the lower division in O are next-states of previous states in the upper division on taking transitions on symbols in A, the row vectors in the lower division must therefore also exist in the upper division , the closed property of O. Thus, for every string, s’, in S.A, there is an s in S where both strings, s’ and s, have the same vector. As shown in Figure 3.6, the vectors in the lower division of O existed in the upper division where all next-states are existing states. b) consistent Each pair of vectors with the same subset of distinguishing strings should represent the same state. The next-state vectors from this pair of vectors should be the same vector representing the same next-state reached, the consistent property of O. Thus, any pair of strings, s1, s2 in S, with the property of row(s1)=row(s2), then row(s1.a) = row(s2.a) for all a in A. As shown in Figure 3.7, the rows represented by strings λ and 11is to be consistent when both strings is representing the same ** A suffix-closed set is where every suffix of each member is also an element of the set.
  23. 23. Chapter 3: Non-Probabilistic Learning 16 distinct state represented by vector (10) moves into the same next-state(s) represented having the same vectors. λ 0 λ 1 0  previous state 0 0 1 1 0 0  next-state 11 1 0  previous state 00 1 0 01 0 0 10 0 0 110 0 1 111 0 0  next-state Figure 3.11: Observation table which is consistent, where two rows λ and 11 represented by the same row vector (1 0) has the same row vector (00) representing the same next-state reached from both rows in the upper division (shaded region). The observation table O is updated by extending the rows and columns (discovering more states and the characteristics of each state) using membership queries and equivalence queries, as shown in Figure 3.8 and Figure 3.9. An update is carried out in two circumstances: T1 λ S ∪ S.A = {λ, 0, 1} λ 0 T(λ.λ) = 0 row vector not in 0 1 T(0.λ) = 1 upper division  T(1.λ) = 1 1 1 make closed: S ∪ {0} T2 λ λ (a) 0 S = {λ, 0} (b) newly added row 0 1 E = {λ} with vector (0)  S ∪ S.A = {λ, 0, 1, 00, 01} 1 1 00 1 01 1 Figure 2.12: (a) Observation table T1 not closed with vector (0). (b) a closed T2 with extension to T1 adding new row into table representing new state discovered. a) when either one of the close and consistent properties of O does not hold: • O is not closed when a vector is not represented in the upper division. A new state is said to be discovered (learnt) as it is a non-existing next- state. From Figure 3.8(a) the row with vector (0) in the lower division is not represented in the upper division, indicating that the next-state is not an existing state. Then O is updated by S ∪ {s’} where s’ ∈ S.A Thus, Figure 3.8(b) shows the updated O with new string (row) in S and new row in the upper division representing the new state learnt.
  24. 24. Chapter 3: Non-Probabilistic Learning 17 [Adding s’ into S still maintains the prefix-closed property of the set as s’ is an element of S being appended with an input letter from the alphabet.] Note: Membership queries are used to complete the table entries whenever E or S is extended. The queries are made on strings in the (S ∪ S.A).E where a yes answer from the teacher means a ‘1’ entry in O and vice versa. • O is inconsistent when two rows with the same vector have a pair of different next-state vectors. This indicates that one of the pair of strings s, s’ in S actually represent a different (newly discovered) state not in the existing states (rows). As in Figure 3.9(a), pair of rows with same vector lead to different next-state on transition ‘1’ in O1. Then O is updated by E ∪ {a.e} where a is the transition symbol which brought the two states to a different next-state and e is the element in E where the next-state vector differs (i.e. at one of the attributes). Thus, Figure 3.9(b) shows the updated O2 with an extra column represented by string ‘1’, the transition symbol which brought the pair of rows to different rows. The e element previous E where the difference is seen is λ. All the table entries resulting from this additional column are filled in using membership queries on the new (S ∪ S.A).E. [The suffix-closed property of E is also maintained with ‘a.e’ added to E, where e is the previous suffix element being added to the set before ‘a.e’.] O1 λ 0 O2 λ 0 0 λ 1 0 λ 1 0 0 0 0 1 same: 0 0 1 0 01 0 0  current state 01 0 0 0 010 0 0  current state 010 0 0 1  new row 1 0 0 1 0 0 1 (new vector) 00 1 0 different: 00 1 0 0 011 0 1  next-state 011 0 1 0 0100 0 0 0100 0 0 0 0101 1 0  next-state 0101 1 0 0 The e column which the rows Make consistent differs at shaded entries, ‘0’ E∪{1.λ} and ‘1’ Figure 3.13: (a) O1 is inconsistent with different next-state vector for a pair of rows with same vectors representing same state. (b) An updated O2 obtained with newly learnt state represented by the new row with new vector in the upper division of new table. c) when a counterexample y is returned from an equivalence query S is extended during learning to include all the prefixes of y. Thus, the upper division of table is extended with new strings and membership queries are used to fill in all new entries. We now have the questions of “when to built a tentative M using the data structure?” and “how a tentative M’ is built from data structure?”. The tentative M’ in Figure 3.10(a) is built only when the observation table O has both the properties of closed and consistent, as in (a) where all upper rows with the same vectors leads to a pair of rows with the same vector and all vectors in the lower division are represented in the upper division.
  25. 25. Chapter 3: Non-Probabilistic Learning 18 The latter question is answered with a closed and consistent O. This closed and consistent O is used to build a tentative deterministic FA (DFA), M’, with each distinct vectors (i.e. distinct rows) in the upper division representing a state in M’. Then, the M’ is completed by having transitions on all symbols in A from every states. The next-state here is determined by a look-up at the rows represented by the string s.a (i.e. the resulting string from taking a transition a from row s) in the table for the corresponding vector to each strings. From Figure 3.10(b), the conjecture M’ is built from the closed and consistent observation table O, in Figure 3.10(a). The states in M’ are the distinct vectors in the upper division, which are each shared among strings representing all the rows in O. M’ is the minimum DFA that accepts all non-empty strings as equivalence query on M’ returns a yes answer. O Col1 (= λ) s2 = {0, 1,00,01} s1 = λ T(λ) = 0 Figure 3.14: (a) s1 = {λ}  distinct vector s2 = 0 1 The distinct vector  observation 0,1 s3 = 1 1 table, the final the (also O, for state) s4 = 00 1 unknown FA, c, λ s5 = 01 1 recognising the set of all non- M’ 0,1 empty strings. The rows are elements of S ∪ S.A and columns are elements from E. (b) The (b) conjecture, M’, is constructed using the closed E = {λ} and consistent O. The final state being the row having vector (1) (bold arrow). λ is always the S = {λ, 0} initial state being the first row in the table, a non-accepting state in this case. The next-state (a) transitions are the strings {0,1,00,01}. Each conjecture, M’, is then presented to the teacher in the form of an equivalence query. At this point, if the guess is correct, no counterexample is returned and M’ is the minimum DFA equivalent to c, as in Figure 3.10(b) where an equivalence query on M’ returns a yes. Thus L1 stops learning and output M’ as its hypothesis. The conjecture M’ is a minimum DFA representing the unknown FA. However, if a counterexample is returned, an update is carried out to the observation table (i.e. adding all prefixes to S) and another update if the updated table with additional prefixes is not closed and/or inconsistent. Next conjecture is built when both properties are satisfied. Membership queries are used to fill in new entries for the new rows obtained from the extended S where counterexample and its prefixes are the learner’s choice in presenting membership queries. This minimality on the number of states in the conjectured DFAs is maintained by the closed and consistent property of the observation table. Through the consistent and closed test on every updated table, two rows that have the same vector is considered as belonging to the same equivalent class by Myhill-Nerode Theorem (i.e. class x with the same behaviour for a set of distinguishing strings). Thus, building a conjecture only if a closed and consistent observation table is obtained after every update and taking only the distinct vectors as representing the distinct states in a building DFA always results in a minimum DFA. “How to start learning?”. This question brings us to the important role of the null string λ, which both S and E starts with as the first element. This string not only brings us to
  26. 26. Chapter 3: Non-Probabilistic Learning 19 discovering the initial state q0 (being the first row in the table) but also as the distinguishing string which is uses to decide which of the distinct vectors are accepting or rejecting states. Being the first element of E therefore allows every string in all rows to be queried by the learner whether it is accepted or rejected by c in the membership queries. Thus, a row which has the λ as its set of distinguishing strings indicate that the vector with ‘1’ entry in the λ column must represents an accepting state as the string represented by that vector is accepted. From Figure 3.10(a), the vector (1) for row with string ‘0’ represents an accepting state as ‘0’ is accepted at column represented by λ, which is also in the set of distinguishing strings for row ‘0’ indicated by the ‘1’ entry. Learning process thus starts with S and E having only one element (i.e. the null string) and the initial table with only a column and three rows (one row for λ in the upper division and the other 2 rows for the next-state rows in the lower division). Another illustration in is shown in , with learner trying to learn the FA that accepts all string with even 0’s and 1’s. The initial table is constructed as Oo which is not closed. L1 updates the table until an equivalence query initiate a termination by a yes answer for conjecture M1 after five updates (i.e. five observation tables) and two conjectures. The examples required by the learner are obtained through membership queries and counterexamples both from the training set t consisting of positive and negative examples (i.e. the ‘0’ and ‘1’ entries accepted and rejected strings). O0 λ O1 λ O1 1 λ 1 make close λ 1 S={λ,0} 1,0 0 0 S∪{0} 0 0 E={λ} λ 0 1 0 1 0 S={λ} 00 1 M0 : Equivalence query  no (y=010) E={λ} 01 0 O2 λ O3 λ 0 O4 λ 0 1 λ 1 λ 1 0 λ 1 0 0 0 0 make 0 0 1 make 0 0 1 0 01 0 consistent  01 0 0 consistent  01 0 0 0 010 0 E={λ}∪{0.λ) 010 0 0 E∪{1.λ} 010 0 0 1 1 0 1 0 0 1 0 0 1 00 1 00 1 0 00 1 0 0 011 0 011 0 1 011 0 1 0 0100 0 0100 0 0 0100 0 0 0 0101 1 0101 1 0 0101 1 0 0 O4 is closed and consistent S = {λ, 0, 01,0 10} E = {λ, 0, 1} λ 0 0 λ Equivalence query  yes M1 0 1 1 1 1 0 010 01 0
  27. 27. Chapter 3: Non-Probabilistic Learning 20 Figure 3.15: Running examples of learning the unknown FA that accepts the set of all strings with even number of 0’s and 1’s.
  28. 28. Chapter 3: Non-Probabilistic Learning 21 3.1.2L2: by Kearns and Vazirani [Kearns et al 94] This algorithm uses the same principles as L1 (i.e. using membership and equivalence queries and positive and negative examples) but the data structure used to construct the tentative FA, M’, is a classification tree, as shown in Figure 3.19. The leaves of the classification tree represent the states learnt (known) in c and the nodes represent the distinguishing strings required to distinguish (discover) the states in c. All the nodes and leaves are represented by a string each, based on the information received from counterexamples and membership queries on chosen strings. Root (d1= λ) T1 Nodes : d1, d2, d3 Leaves: s1, s2, s3, s4 d2 = 0 States : s1, s2, s3, s4 s1 = λ Distinguishing strings = {λ, 0, 1} training set t = {+λ, -0, -1, -01} d3 = 1 s2 = 0 s4 = 01 s3 = 1 Figure 3.16: Classification tree, T1, with 3 node representing 3 distinguishing strings and 4 leaves each representing an equivalence class. The Myhill-Nerode Theorem is also adopted by L2, that is, maintaining the set of distinguishing strings that distinguishes between the equivalent states represented as leaves in the tree. The leaves can be viewed as the equivalence class x containing a set of strings (representative strings) having the same behaviour (distinguishability) with respect to c and the set of distinguishing strings. Thus, each node is seen as the distinguishing string between the children on its right subtree and left subtree (i.e. the leaves in the subtrees) to accepting strings and rejecting string respectively. In Figure 3.12, the node d3, represented by string 1, distinguishes between the leaves, 01 and 1, in its right and left subtree respectively with respect to the FA that accepts all strings with even 0’s and 1’s. The next-state which a string x reached with transition symbol a is determined by traversing the tree with the string xa starting from the root until a leaf s is reached. At each node d visited, the next path to take depends on whether the string xad is accepted or rejected by c. The right path is taken if xad is accepted by the unknown FA and left path if otherwise. The leaf s reached is the equivalence class where x belongs. Thus, xa is said to represent the state represented by s. The membership queries are used here to query which path to take with xad being the string of the learner’s choice. As in Figure 3.12, the string 01 when traversed through the tree landed up in leaf s4 where the string 011 is rejected at node d3 in T1 by taking the left path to leaf s4. However, in T2 from Figure 3.13, it is accepted as the traversal reached a right leaf s1 from the root d1. Thus, the string 01 is said to represent state s4 and s1 in T1 and T2 respectively with respect to the FAs being learnt. Figure 3.17: Classification tree, T2 with one node and two leaves represented by the strings in Root Nodes : d1 T2 (d1 = λ) Leaves: s1, s2 States : s1, s2 Distinguishing strings = {λ} training set t = {-λ, +01} s2 = λ s1 = 01
  29. 29. Chapter 3: Non-Probabilistic Learning 22 The classification tree, T, maintains two main elements to represent the properties learnt of c and also the information received from the training set. The elements are specified as the following: 1. a set of access strings, S The initial set contains only one string, the null string λ. The leaves in T are each represented by strings in S. All the leaves then represent the known states discovered insofar of the unknown FA. The leaves in left subtree of root contains all s in S that are rejected by c and the leaves in the right subtree of root are strings that are accepted by c. Thus, S is subdivided into 2 subsets of accepting and rejecting states (i.e. the leaves). From Figure 3.13, S is the set of strings representing the leaves and the 2 subsets for T2 are {λ} and {01} where both sets are the positive and negative set of examples from t. 2. a suffix-closed set of distinguishing strings, D The initial D starts with the null string, λ, and is used to distinguish each pair of access strings in S. The strings in D represent the nodes of the T. Each node, d, has exactly two children, distinguishing a pair of strings in S such that the right subtree consists of strings s.d that are accepted by the unknown FA at node d and vice versa. (a) t = {+λ} + As in Figure 3.13, the root is the distinguishing string for the leaves in the right and left subtree from the root where strings λ and 01 representing the leaves are also representing the rejecting and accepting states respectively. The classification tree is used to build every conjecture M’ except the initial conjecture (i.e. the learner’s first guess) of FA, M0. There are only two different initial conjectures to choose from as M0, as shown by the initial conjectures in Figure 3.14(a) and Figure 3.14(b) which has only the single start state with all transition to itself. This initial guess depends on a membership query on the null string λ. Thus, M0 either accepts or rejects the set of all strings depending on whether the λ is accepted by the unknown M, that is, the initial state is an accepting state if λ is accepted by M and vice versa. T0 0 M +λ (a) 0,1 λ y +λ M0 (b) 0,1 -λ T0 -λ y λ (b) t = {-λ} Figure 3.18: (a) M0 accepting all strings as t provide the positive example, +λ. (b) M0 accepts the empty set where t provide the negative example, -λ. The corresponding tree T0 is an incomplete tree to be completed with the counterexample returned after an equivalence query on M0.
  30. 30. Chapter 3: Non-Probabilistic Learning 23 As in L1, for every conjecture M’ produced, an equivalence query on M’ is presented by the learner. L2 terminates if no counterexample is returned. Thus, the first guess from the learner is whether all strings are accepted or rejected by the unknown M. The first counterexample represents the remaining unrepresented leaf y in the incomplete T0 for the initial conjectures, as in Figure 3.19, where y is the one of the leaves for each tree. The subsequent counterexample y returned is analysed using the divergence concept (see below) and a current classification tree, T. From Figure 3.15(a), the unknown FA accepting all non-empty strings results in the initial M0 being the DFA accepting the empty set. An equivalence query to M0 returns the counterexample string +01. As λ is rejected at the root, the first classification tree, T0, as shown in Figure 3.19(b), has λ as its left child and the counterexample 01 as its right child. M0 λ λ 0,1 T0 Root (≡ λ) M’ S = {λ} λ 01 D = {λ} 0,1 λ Equivalence query  (y=01) λ y=01 S = {λ, 01} 0,1 (a) D = {λ} S = {λ, 01} D = {λ} counterexample, y = 01 Equivalence query  yes (b) (c) Figure 3.19: An unknown FA that accepts the set of all non-empty strings is being learnt; (a) 0 0 The initial conjecture, M ; (b) The classification tree, T , with 2 leaves from S and a node (root) from D; (c) The conjecture, M’, is constructed using the classification tree in (b) with the first leaf, λ as the initial state and the second leaf, represented by string, 01, as the other state in M’. As 01 is a leaf on the right subtree from root, the final state is represented by the leaf, 01.
  31. 31. Chapter 3: Non-Probabilistic Learning 24 In Figure 3.15(b), T0 is then used to build a tentative deterministic FA (DFA), M’ (Figure 3.19(c)), using the leaves λ and 01 to represent the states of M’ where all states in M’ are labelled with the leaves of T0. The next-state transitions for every states in M’ is done by traversing T0 with the string representing each state appended with a transition symbol a. The strings used for the next-state transition are {011,010, 0, 1}. The next equivalence query returns a ‘yes’ which terminates learning, as in Figure 3.15(c). From every counterexample y, each prefix of y is analysed to determine the prefix yi that leads to different states when it is tested on both the current T and the conjecture M’ which returns y in the equivalence query on M’. Both tests will result in a pair of states: a leaf and a state from T and M’ respectively. Since M’ is built by taking all the leaves in T to represent the states in conjecture, then the pair of states from the tests should point to the same state for a string if T and M’ are equivalent. Thus, there must be a node and transition symbol (path) that yi takes leading to first different pair of states. This is called the divergence point. Thus, a counterexample indicates that somewhere along the string, y = y1…yn for n input symbols in y, at one of the prefix, ym, M’ and T diverge into a different path leading to different state. The divergence point is ym-1 (i.e. the immediate prefix before ym where pair of different states sM and sT is obtained from M and T respectively in the test. where ym is the prefix where divergence occurs The current tree, T, is used to trace the common ancestor for both sM and sT, that is, the node d, that distinguishes the leaves represented by sM and sT . Both d and ym-1 are used to update the classification tree. Figure 3.16 shows how the divergence point is found from counterexample y. λ T’ M’ 0,1 λ 0,1 0 λ 0 01 0 λ 1 0 01 y = 11 0 Prefixes y1: {1, 11} y = 11  Equivalence query (M’) where i = 1, 2 y1 = 1  sM = 01 Divergence point = y1 sT = 01 common ancestor d = 0 y2 = 11 sM = 01 sT = λ Figure 3.20: The unknown FA accepts all string with even 0’s and 1’s. The conjectured M’ is returned with counterexample y = 11 in equivalence query. Each prefixes of y is traversed in T’ and also M’ to find the divergence point, y1 = 1 where divergence occurs at y2. The common ancestor for 01 and λ where y2 diverged to is 0 from T’.
  32. 32. Chapter 3: Non-Probabilistic Learning 25 As the learner learns more information and properties of c, the new information and properties (regarding new states) are updated in the tree by extending the nodes and leaves. These updates are carried out only using the information from equivalence query, that is, the counterexample. The counterexample is then analysed for divergence point and the results from the divergence analysis are the common ancestor d and the prefix ym-1, representing the divergence point. Therefore, the tree is updated using d and ym-1 as follows: a) new access string, ym-1 (i.e. a prefix of y), to add to S a new state represented by ym-1 is discovered with a new leaf is to be updated onto T as S is extended. [S is extended to include the prefix of a counterexample representing a known state of c discovered as shown in Figure 3.17 by the shaded leaf 1] b) new distinguishing string, a.d, to add to D where d: the most common ancestor for sT and sM a: the input letter that leads ym-1 to sT and sM (i.e. ym = ym-1.a) ym = y1…ym (the prefix of y) ym = mth symbol in y sT and sM : states reached in T and M [D is extended when the counterexample is returned and a prefix (i.e. to be included in the extension of S) of the counterexample is identified as the point of divergence. D is to include a new distinguishing the string, a.d, where d is the common ancestor node and ‘a’ is the input letter leading from divergence point the two different strings reached. As shown in Figure 3.17 where the new node is shaded.] The new extensions involve sT being replaced by the new string a.d forming a new internal node. The leaves, sT and ym-1, are the children of the new node and their position depends on the acceptance of each concatenated with a.d through membership queries. The suffix-closed property of D maintains reachability from other states to the final state(s) each time a new state (leaf) is discovered (i.e. added into S). Figure 3.17 shows how the tree T’ in Figure 3.16 is updated and used to build a new conjecture M”. λ 0 λ 0 λ T” 0 0 λ 1 1 1 1 1 0 New conjecture 0 01 1 0 1 01 Figure 3.21: The new updated tree from Figure 3.16, T”, has new node and new leaf (in shades) being added when a divergence point is found in previous counterexample. New conjecture M” is queried for equivalence query and no counterexample returned.
  33. 33. Chapter 3: Non-Probabilistic Learning 26 We illustrate another example of L2 learning the FA that accepts the strings with even 0’s and 1’s in Figure 3.18 below. The divergence point in done and represented by ym. Conjecture Equivalence Classification tree query λ M0 λ  no T0 0,1 y = 01 S={λ,01} D={λ) 01 λ λ λ M1 0,1 0,1  no T1 λ 0 y = 00 S={λ,01,0} λ λ 01 ym = 00 D={λ,0} 0 01 λ 0,1 M2 λ 0,1  no T2 0 0 01 y = 11 S={λ,01,0,1} λ λ 1 ym = 11 D={λ,0,1} 0 1 0 0 01 1 λ 1 M3 1  yes λ 1 0 0 0 0 1 0 01 1 Figure 3.22: Running example of learning the unknown FA that accepts set of strings with even number of 0’s and 1’s. Therefore, the access strings in S are prefixes from counterexamples where the size of S is also the number of counterexamples returned (or number of equivalence queries made). L2 maintains a fixed-size S where each string represents a distinct state of the minimum DFA of M. The size of S is at most to the size of the minimum DFA for M at any point of time during the learning process. Hence, each counterexample produces a new access string that immediately creates a new conjecture with a newly discovered state.

×