1. A Survey of Learning Methods In
Learning Finite Automata (FA)
Submitted By: Priscilla Chia
Supervised By: Dr. Suresh K. Manandhar
Final Year Project Report 1998
Computer Science Department
University Of York
2. Abstract
This report is a survey of learning methods used learning Finite Automata (FA). The
learning issues in machine learning are highlighted and the methods surveyed are
analysed according to how these issues are dealt with. The report also looks at how
additional information is learnt based on given information by the teacher. We
surveyed six algorithms with respect to the various learning methods employed in the
learning process: building a hypothesis and evaluating the hypothesis. The methods
are categorised into probabilistic and non-probabilistic. We conclude with a
discussion on the ability of hypothesis towards error rectification of past experience
instead of only learning new ones.
3. Acknowledgement
I am very grateful to my supervisor, Dr. Suresh Manandhar, for his invaluable help
and advise throughout this project. I would also like to thank my friends and family
for their support especially mum and dad at home.
4. CONTENTS
1. INTRODUCTION...........................................................................................1
2. LEARNING....................................................................................................2
2.1 Learning in General...........................................................................................................................2
2.2 The Issues in Learning.......................................................................................................................5
2.3 Learning Finite Automata (FA)........................................................................................................6
2.4 Learning Framework.........................................................................................................................7
2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7
2.4.2 PAC VIEW............................................................................................................................8
2.4.3 COMPARISON.........................................................................................................................8
2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9
2.5 Results on learning finite automata................................................................................................10
3. NON-PROBABILISTIC LEARNING FOR FA.............................................11
3.1 Learning with queries......................................................................................................................13
3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14
3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21
3.1.3 DISCUSSION.........................................................................................................................27
3.2 Learning without queries................................................................................................................29
3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30
3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35
3.2.3 DISCUSSION.........................................................................................................................39
3.3 Homing Sequences in Learning FA................................................................................................42
3.3.1 HOMING SEQUENCE...............................................................................................................43
3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43
3.3.3 DISCUSSION.........................................................................................................................46
3.4 Summary (Motivation forward).....................................................................................................47
4. PROBABILISTIC LEARNING.....................................................................50
4.1 PAC learning using membership queries only..............................................................................50
4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50
4.1.2 DISCUSSION.........................................................................................................................51
4.2 Learning through model merging..................................................................................................54
4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
4.2.2 LEARNING FA: REVISITED....................................................................................................55
4.2.3 BAYESIAN MODEL MERGING.................................................................................................57
4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60
4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64
4.2.6 DISCUSSION.........................................................................................................................67
5. 4.3 Summary...........................................................................................................................................68
4.4 Chapter Appendix............................................................................................................................70
5. CONCLUSION AND RELATED WORK.....................................................71
REFERENCES...............................................................................................74
1. INTRODUCTION...........................................................................................1
2. LEARNING....................................................................................................2
2.1 Learning in General...........................................................................................................................2
2.2 The Issues in Learning.......................................................................................................................5
2.3 Learning Finite Automata (FA)........................................................................................................6
2.4 Learning Framework.........................................................................................................................7
2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7
2.4.2 PAC VIEW............................................................................................................................8
2.4.3 COMPARISON.........................................................................................................................8
2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9
2.5 Results on learning finite automata................................................................................................10
3. NON-PROBABILISTIC LEARNING FOR FA.............................................11
3.1 Learning with queries......................................................................................................................13
3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14
3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21
3.1.3 DISCUSSION.........................................................................................................................27
3.2 Learning without queries................................................................................................................29
3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30
3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35
3.2.3 DISCUSSION.........................................................................................................................39
3.3 Homing Sequences in Learning FA................................................................................................42
3.3.1 HOMING SEQUENCE...............................................................................................................43
3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43
3.3.3 DISCUSSION.........................................................................................................................46
3.4 Summary (Motivation forward).....................................................................................................47
4. PROBABILISTIC LEARNING.....................................................................50
4.1 PAC learning using membership queries only..............................................................................50
4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50
4.1.2 DISCUSSION.........................................................................................................................51
4.2 Learning through model merging..................................................................................................54
4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
6. 4.2.2 LEARNING FA: REVISITED....................................................................................................55
4.2.3 BAYESIAN MODEL MERGING.................................................................................................57
4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60
4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64
4.2.6 DISCUSSION.........................................................................................................................67
4.3 Summary...........................................................................................................................................68
4.4 Chapter Appendix............................................................................................................................70
5. CONCLUSION AND RELATED WORK.....................................................71
REFERENCES...............................................................................................74
1. INTRODUCTION...........................................................................................1
2. LEARNING....................................................................................................2
2.1 Learning in General...........................................................................................................................2
2.2 The Issues in Learning.......................................................................................................................5
2.3 Learning Finite Automata (FA)........................................................................................................6
2.4 Learning Framework.........................................................................................................................7
2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7
2.4.2 PAC VIEW............................................................................................................................8
2.4.3 COMPARISON.........................................................................................................................8
2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9
2.5 Results on learning finite automata................................................................................................10
3. NON-PROBABILISTIC LEARNING FOR FA.............................................11
3.1 Learning with queries......................................................................................................................13
3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14
3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21
3.1.3 DISCUSSION.........................................................................................................................27
3.2 Learning without queries................................................................................................................29
3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30
3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35
3.2.3 DISCUSSION.........................................................................................................................39
3.3 Homing Sequences in Learning FA................................................................................................42
3.3.1 HOMING SEQUENCE...............................................................................................................43
3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43
3.3.3 DISCUSSION.........................................................................................................................46
3.4 Summary (Motivation forward).....................................................................................................47
4. PROBABILISTIC LEARNING.....................................................................50
7. 4.1 PAC learning using membership queries only..............................................................................50
4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50
4.1.2 DISCUSSION.........................................................................................................................51
4.2 Learning through model merging..................................................................................................54
4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
4.2.2 LEARNING FA: REVISITED....................................................................................................55
4.2.3 BAYESIAN MODEL MERGING.................................................................................................57
4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60
4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64
4.2.6 DISCUSSION.........................................................................................................................67
4.3 Summary...........................................................................................................................................68
4.4 Chapter Appendix............................................................................................................................70
5. CONCLUSION AND RELATED WORK.....................................................71
REFERENCES...............................................................................................74
References 72
8.
9. Introduction 1
1.Introduction
The class of finite state automaton (FA) is studied from machine learning perspective which
involves learning issues and the properties of that particular class. This report is a survey on
the learning methods studied and employed in learning FA. We give an overview on learning
in general in Section 2.1 and the issues of learning in Section 2.2 with application towards
learning FA in Section 2.3. The two important frameworks employed extensively in machine
learning which are learning in the limit and PAC learning, are explained in Section 2.4. The
complexity of learning FA itself has been studied and the results are given in Section 2.5.
This report which concerns the learning methods employed are divided into two main
chapters where various learning algorithms are studied and compared. The usual non-
probabilistic learning is discussed in Chapter 3 with the motivation towards probabilistic
learning in Section 3.4 before the probabilistic learning is discussed in Chapter 4.
The results of the surveyed is in every chapter and the conclusion with related works
in machine learning is in Chapter 4. The are 6 algorithms discussed and each is referred to as
L1-L6 throughout this report as the to the following algorithms:
• L1: [Angluin 87]
• L2: [Kearns et al 94]
• L3: [Porat et al 91]
• L4: [Rivest et al 87]
• L5: [Anlguin 87; Natarajan 91]
• L6: [Stolcke et al 94]
We follow the standard definition of FA as studied in any automata theory [Hopcroft et al 79;
Trakhtenbrot 73] and give the following terminology and notation that are used for any FA M:
set of states, Q : the set of finite states q in FA
final state : the start reached by an input string that is not recognised by M
initial state, q0 : the start state for all input strings
accepting state : final state which accepts a string that is not recognised by M
rejecting state : final state which rejects a string
transition, δ(q,a): path from a state q with symbol a from alphabet set
alphabet set, A : the set with finite symbols and binary set {0,1} is applied.
10. Chapter 2: Learning 2
2.Learning
2.1Learning in General
Learning in general means the ability to carry out a task with improvement from previous
experience. It involves a teacher and a learner. The learning process usually takes place in an
environment which constrains the communication between the learner and teacher: how the
teacher is to teach or train the learner and how the learner is to receive input from the teacher;
and elements or tokens of information that is communicated between the teacher and learner:
a class of objects (i.e. a concept) and a description of a subclass (i.e. an object).
Example 1(a): Environment for learning a class of vehicles
The environment in which the learning process takes place involves a teacher
giving descriptions of ships and a learner drawing a conclusion of how a ship
looks like from descriptions received. The teacher describes ship (i.e. the
subclass of vehicles) by providing pictures (i.e. using pictorial means) of
ships and the learner responds (i.e. communicates with the teacher) through
some form of visual mechanism (i.e. by detecting shapes or colours of object
in pictures) to analyse pictures received from the teacher. This environment
only allows the teacher and learner to communicate through pictures, whereas
in another environment, other forms of encoding of descriptions may be used
(e.g. tables of attributes - width, length, windows, engine capacity etc.)
A c1 B
:
cn
] :
r1
: : :
cs :
:
cm
] ri
Figure 2.1: Finite representation of a possibly infinite class A of m elements cs for 1≤ s ≤ m
where m may be finite of infinite, by another finite class B with p elements, ri for 1≤ i ≤ p
where p is some finite number.
The learner is to learn an unknown subclass from the class with the help (i.e. some
form of training) from the teacher who provides descriptions of the unknown subclass. Since
the subclass to be learnt may be infinite in size, a finite representation is needed to represent
the probable subclasses hypothesised during learning. The task of the learner is to hypothesise
a (finite) representation of the unknown subclass as shown in Figure 2.1 where class A of a
possibly finite set is represented by a finite class B. Thus learning the class A is to learn its
class B representation.
In Example 1(a) above, the learner is to produce a hypothesis of the ship subclass. A
finite representation for the hypothesis is necessary for the unknown subclass chosen to be
learnt, as not every subclass can be finitely presented or described (i.e. by presenting all
elements of the unknown subclass to the learner) by the teacher, as shown in the class of
vehicles above where the subclass (i.e. ships) is infinite. Note that there are finitely many
11. Chapter 2: Learning 3
different ‘types’ of ships as there are finitely many different ‘types’ of vehicles where in both
cases the ‘ships’ and ‘vehicles’ are infinite but the ‘types’ are finite.
Instances from the class used to describe a particular subclass are called examples.
These examples are usually classified by a teacher (i.e. a human or some operation or
program available in the environment) with respect to the particular subclass being learnt as
positive (a member of the subclass) or negative (non-member of the subclass) examples. A set
of classified or labelled examples is called the example space.
Only a subset of the example space is used by learner each time an unknown subclass
is to be learnt. This subset, used in training the learner, is known as the training set. Each
example space contains information (implicit properties or rules that may be infinite) relevant
in distinguishing one subclass from another in the given class. The constraints in the
environment also determine which type (i.e. positive only, negative only or both) of examples
can be provided by the teacher to form the example space. For instance, it may not be possible
to collect negative examples and the teacher is restricted to only positive examples which may
not be a partial set (i.e. not all members of the unknown subclass is known even to the
teacher).
The learner or learning algorithm is therefore required to learn the implicit properties
or rules from the information given (built into what is called experience) of a particular
subclass. The properties learnt are stored in the learner’s hypothesis (i.e. conclusion or
explanation) drawn of the sub-class.
An infinite number of hypotheses of any form of representation (i.e. decision tree,
propositional logic expression, finite automata etc.) could be produced that hold the properties
obtained from the information received. This results in searching a large hypothesis space. It
should be noted that the hypothesis space could be expressed in the same descriptive
language used to describe the unknown subclass: In Example 1(a), if the class of vehicles are
represented in the form of propositional logic expressions then the hypothesis may be the
exact propositional logic expression that represents the unknown subclass chosen (i.e. ships)
or in some other form of representation that is equivalent to the propositional logic
expressions used.
A set of criteria is necessary to limit (reduce) the size of the search space. Given a
reduced hypothesis space that satisfies the set of criteria, the learning goal, is then needed in
selecting and justifying a hypothesis from the hypothesis space as the finite representation of
the unknown subclass. Together with other knowledge about the rules to manipulate the
descriptive language, the set of criteria and learning goal form what is called background
knowledge to guide the learner in the learning process.
Example 1(b): Learning process of Example 1(a)
Suppose that the hypothesis for ships take the form of a collection of finite
number of attributes for ship (i.e. size, engine capacity, shape, weight, anchor
and other properties of a ship). The criteria for hypothesis space could
include hypotheses that fulfilled five out of say, six attributes used and the
learning goal is to be able to select hypothesis that satisfy the criteria with the
simplest data structure of some form and could successfully identify
subsequent say, ten examples, correctly. There could be infinite number of
attribute used but the criteria in the background knowledge reduces the
hypothesis space.
12. Chapter 2: Learning 4
Thus, the learning scenario (Figure 2.2) consists of a given a class, C, of subclasses
and an example space, T, from where the training set, t, is drawn. Examples in T are used to
describe an unknown subclass, c, in C. The aim of the learning algorithm, L, is to produce a
hypothesis, h, from a hypothesis space, H, using information from t and satisfying the
conditions set out in the background knowledge. L is to build an h that is equivalent to c.
Ideally, h is to be exactly the same as c or h is the exact representation of c. Due to the
incompleteness (i.e. teacher usually does not have complete information regarding c) of t
received, h is usually taken to be equivalent to c to some extent expressed in the background
knowledge. In both cases, learning relies on information contained in t and given by the
teacher.
L:TH where T : sets of t for a sub-class, c, in C. Also called, example
space.
L(t) = h (≡R c) t∈T
h∈H
c∈C
≡R: the equivalence relation specified by the learning goal
used in selecting hypothesis, h, using t. The selected h
contains learnt properties or rules of c that are obtained
through information from t.
C T H
Background
teacher t Knowledge:
c h • set of criteria
L
• learning goal
Example
• type of
space
representation
h (descriptive
L language for H)
Class
Environment Hypothesis space
Figure 2.2: The learning scenario of a learner or learning algorithm, L, with a given
environment.
13. Chapter 2: Learning 5
2.2The Issues in Learning
The algorithms used in learning are ‘ways’ of achieving the learning goal under the set of
criteria in the background knowledge. ‘Ways’ here are methods of constructing hypothesis
from information in the set of examples. As shown in Figure 2.2, the learning algorithm, L,
has two distinct phases in the learning process:
Phase 1: forming hypothesis, h, from set of examples, t.
[shown as arrow from T to H in Figure 2.2]
Phase 2: selecting and justifying h as a finite representation of the unknown subclass,
c.
[shown as arrow from H to C in Figure 2.2]
The nature (or design) of L, and the feasibility of the learning problem itself is determined by
the following factors:
1. Example space, T
Usually considered arbitrary where various kind of information (training sets) can
be used to describe c.
2. Classification of training set, t, usually by a teacher or operation carried out on
the environment with respect to a particular sub-class, c.
• Noisy examples are considered where the teacher may classify instances
wrongly
• Type of examples to be presented (i.e. positive only, negative only or both)
3. Presentation of t to L
Whether elements from t is fed into L one-by-one or in a small groups or a whole
batch of t and whether the elements are presented in any particular order (i.e. in
lexicographic order or shortest length first)
4. The size of the t
Intuitively, a small t is needed in learning by an efficient and ‘intelligent’ learner
or learning algorithm. In machine learning, the size of t contributes to the
computational complexity of a learning algorithm, the larger t is, the longer (or
more complicated) is the computation.
5. The choice of representation for the hypothesis space, H
This involves issues of how much information to capture and can be captured by
a particular choice of representation. A rich descriptive language ideally required
as representation means a more complex computation and larger resource (i.e.
memory storage) requirement, whereas a simple form of representation may not
capture sufficient information to learn.
6. Selection criteria of a hypothesis, h, and justification as an equivalence of c.
All of the above except the last factor constitute to a major part in the design of an
algorithm in machine learning, to be exhibited in Phase 1 of L. The last factor and also the
choice of representation for H, are usually vital in Phase 2 of the learning process where
evaluation are carried out by human experts or some known mechanism such as statistical
confirmation or analysis.
The learner, L, is said to be able to learn a class in the given environment if it can
learn (i.e. by producing a hypothesis that satisfies both the criteria and learning goal in the
background knowledge provided a priori) any subclass chosen from the class.
14. Chapter 2: Learning 6
2.3Learning Finite Automata (FA)
This report investigates the learning process in a particular environment setting (Figure 2.3):
- Teacher: as the source of example space, T, where description to unknown subclass, c, is
of the form of labelled strings.
- Learner: learns by receiving information in a form of labelled string drawn from T
following rules set out in the environment constraints.
- c: the unknown regular language or FA
Two almost similar environments for learning are shown in Figure 2.3 with difference in the
class contents. The first environment (Figure 2.3(a)) consists of:
- C1: the class of all languages,
- H1: the hypothesis space, H1, is finite automaton (FAs) as the finite representation for
regular languages (i.e. a subclass of C1).
- T: the examples are labelled sets of languages.
- Criteria: FA accepts all examples (i.e. strings which may or may not be only positive
strings) received from training set, t.
- Goal: to produce an FA (i.e. the selected hypothesis) that is equivalent to (i.e. that
accepts) c.
The other learning environment (Figure 2.3(b)) can be obtained by refining C1 to be the class
of regular languages only and the hypothesis space, H2, are minimum deterministic finite
automata (DFA). The new environment shown in Figure 2.3(b) has more constraints added
into the environment as the teacher is to provide descriptions using only regular languages as
compared to C1, where the teacher is able to provide descriptions using other languages as
well (i.e. context free languages).
This report concerns with the learnability of finite automaton (FA) using minimum DFAs as
the hypothesis space. Both environments, with C1 and C2 as classes, use the same set of
examples, T, which is a set of strings and the training set, t, is a set of classified strings with
respect to a particular sub-class of languages, c. For consistency, throughout the report, the
alphabet, A, will for FAs will be set to the binary set {0,1}.
C1 T H1 C2 T H2
(class of all (class of FAs ) (class of regular (class of
languages) languages ≡ minimum DFAs)
minimum DFAs)
c’ h c h
c
(a) (b)
Figure 2.3: (a) c’ is the sub-class of regular languages, c, and H1 is the class of FAs with
criteria for H1 being deterministic and minimal in size (number of states). (b) c is a particular
subset of regular languages and H2 is the class of minimum DFA itself where no criteria is
needed.
15. Chapter 2: Learning 7
2.4Learning Framework
Given an environment with a class of objects which describe ‘what is to be learnt’, the two
phases, Phase 1 and Phase 2, in the learning process bring us two fundamental questions:
- ‘how do we learn?’
- ‘when do we know we have learnt?’
The former is being dealt with in Phase 1 and the latter, in Phase 2, is studied by Gold [Gold
67] and Valiant [Valiant 84] resulting in two major learning frameworks being proposed, the
identification in the limit by Gold and the probably approximately correct (PAC) learning by
Valiant.
2.4.1Identification in the limit
[Gold 67] states that learning should be a continuous process with the learner (or learning
algorithm), L, having the possibility of changing or refining his guess (i.e. hypothesis) each
time new information from the training set, t, is presented. The learner, L, is only required to
have all his guesses after a finite time to be the same and correct with information seen so far.
Hence, the hypothesis, h, obtained after a finite time will remain the same and correct with
subsequent information. The hypothesis, h, is said to represent the unknown sub-class, c,
described by t in the limit, completing Phase 2 of the learning process. This learning
framework, identification in the limit, consists of three items as formulated by Gold:
1. A class of objects
A class, C, is specified (or given) to learner in the environment where the form of
communication between the teacher and learner is also specified. An object, c, from
C will be chosen for the learner to identify.
[In the context of this report, the unknown object (or sub-class), c, is an FA and the
class C consists of FAs].
2. A method of information presentation
Information about the unknown chosen object is presented to the learner. The training
set, t, consists of either positive only, positive and negative, and noisy examples as
information describing c.
[t is just a set of labelled strings drawn from example space, T, provided by the
teacher and the type of t depends on T – all positive strings, all negative strings or
combination of both]
3. A naming relation
This basically enables the learner to identify the unknown object, c, by specifying its
name1, h . There is a function, f, for L to map the names to the objects in C. Here, an
object, c, can have several names (hypotheses) where guesses (or hypotheses are
made under f).
[L is to build an FA as the hypothesis, h, for an unknown regular language and h
could be any of the several DFAs (or TMs) that accepts the unknown regular
language].
1
In [Gold 67], name is defined as a Turing Machine (TM). Since the language identified by FA is also
identifiable by a TM, it is sufficient to say that every FA has a TM.
16. Chapter 2: Learning 8
2.4.2PAC view
Another learning framework is Probably Approximately Correct (PAC) learning. This was
first proposed by Valiant [Valiant 84] that uses a stochastic setting in the learning process.
The learner is required to build a (approximately correct) hypothesis that has a minimal error
probability after being trained using the training set, t, constituting Phase 1 of the learning
process. Phase 2 under this framework requires the learner to have a high level of confidence
that the hypothesis, h, is approximately correct as a representation of sub-class, c. The
training set, t, is considered ‘good enough’ with high confidence level. This is appropriate
because t generally does not consist of all the positive examples needed to learn c.
The PAC framework relies on two parameters, accuracy (ε) and confidence limit (δ).
A fixed but unknown distribution is applied over the class of examples, T, where training sets,
t, are drawn at random. Intuitively, PAC learning seems like a passive type of learning with
the learner learning only through observation on given data or information. However,
[Angluin 88] and [Natarajan 91] showed that PAC learning can be used as an active learner
using queries – equivalence, membership, subset, superset, exhaustiveness and disjointness
queries [Angluin 87].
Given a real number, δ, from 0 to 1 and a real number, ε, also from 0 to 1, there is a
minimum sample length (i.e. the size of training set, t) such that for any unknown subclass, c,
with a fixed but unknown distribution on example space, T, there is:
a (1- δ)% chance that ε % of the test set will be classified wrongly by hypothesis h,
where test set is another subset of T different from t to test validity of h.
PAC learning is desirable for a good approximation to c, as in most cases it is
computationally difficult to build an accurate (exact) hypothesis and [Angluin 88] and
[Natarajan 91] have shown that PAC learning can be easily applied to any other non-
stochastic learning framework.
2.4.3Comparison
Both frameworks have distinct criteria and goal for learning which deal with Phase 2 in the
learning process (Table 2.1). However, they both suggest learning by building tentative
hypotheses from a piece of information in the form of a string from training set, t, (Figure
2.4). Those tentative hypotheses are each a new ‘experience’ (i.e. a modified hypothesis with
slight changes or totally new hypothesis) as new information is received from t. The final
hypothesis, h’, taken to represent sub-class, c, may be totally different from previous
hypotheses.
Learning framework Identification in the limit PAC learning
Goal Same hypothesis (or guess) P (h – c = ε ) < 1- δ on a
after a finite time for sufficiently large sample, t.
subsequent information
P = the probability function
received. h = hypothesis
c = the unknown sub-class
δ and ε are parameters needed
Criteria Hypothesis (guess) made must Hypothesis, h, has minimal
be consistent (correct) with error, ε, with respect to T.
information seen so far.
Table 2.1: Comparison between identification in the limit and PAC learning frameworks.
17. Chapter 2: Learning 9
C T H
teacher t ={t1, t2, t3,..}
c h1
L
h2
h h3
L
Environment with oracles
Figure 2.4: A learning scenario with learning algorithm, L, making several tentative
hypotheses (i.e. h1, h2, h3) in H from sequence of labelled examples (i.e. t1, t2, t3).
Recent studies[Kearns et al 94; Rivest et al 88; Porat 91] are done under Gold’s
proposed learning framework, as it is more natural to human learning. We can always change
our perception (hypothesis) each time a new information is received and still being consistent
with the previous information. We never know (or predict) when we finish learning (which is
a perpetual process in humans).
2.4.4Other variations of learning framework
There are two other learning frameworks mentioned by Gold in [Gold 67]:
1. Finite Identification
The learner stops the presentation of information after a finite number of examples
and identifies the sub-class, c. The learner is to know when he has acquired sufficient
number of examples and therefore able to identify c.
2. Fixed-time Identification
A fixed finite time2 is specified a priori (i.e. usually as background knowledge) and
independently of the unknown object presented whereby the learner stops learning
and identifies the unknown object.
These two frameworks seem to ask too much of the learner where the learner is ‘forced’ to
identify the sub-class, c, by outputting a hypothesis, h, after some predicted factor or
condition is achieved. In finite identification, the learner is able to predict the number of
examples needed to learn and stop learning once the predicted number of examples has been
presented. On the other hand, the fixed-time identification requires the learner to know in
some ways ‘when’ he is able to stop learning.
Learning as mentioned earlier, is to identify or distinguish the ambiguous lines separating
each sub-class in a learning environment. Being able to tell when exactly (i.e. able to predict
those lines) to stop learning means that there is no need for learning to start in the first place.
2
Time is taken throughout the report, to correspond with the computational complexity and the
termination of a successful learning algorithm.
18. Chapter 2: Learning 10
2.5Results on learning finite automata.
The complexity and learnability of finite automaton identification have received extensive
research [Gold 67; Angluin 87; Vazirani et al 88]. The computational complexity is being
considered here with respect to the size of the hypothesis space (minimum DFA) searched
and the size of training set (examples) required.
Other complexity results that have dealt with the computational efficiency are as follows:
1. Identification in the limit and learnability model, [Gold 67] – Gold classifies the
classes of languages that are learnable in the limit into three categories of
information presentation (Table 2.2). Learning from positive only examples is
proven to be NP-complete.
2. Inferring consistent DFA or NFA, of the size factor (1+1/8) of the minimum
consistent DFA is NP-complete given positive and negatives examples. [Li et al
87]
3. There is an efficient learning algorithm to find minimum DFA consistent with
given positive and negative data and access to membership and equivalence
queries [Angluin 87], using observation table as a representation of FA.
4. Learning FA by experimentation3 (as in 4 above) [Kearns et al 94], using
classification tree as a representation for FA in polynomial time.
5. State characterisation and Data Matrix Agreement is introduced for the problem
of automaton identification [Gold 78]
6. Inferring minimum DFA’s and regular sets using positive and negative examples
only is NP-complete. [Gold 67,78] and [Angluin 78]
Learnability model Class of languages
Anomalous Text - Recursive enumerable
- Recursive
Informant - Primitive recursive
(using positive and negative - Context sensitive
examples/instances) - Context free
- Regular
- Superfinite
Text Finite cardinality
(using positive only
examples/instances)
Table 2.2: Learnability and non-learnability of languages [Gold 67] where
superfinite language is the class of all finite languages and one infinite regular
language.
These results shows that inferring DFA directly from just examples are NP-hard and
some other learning methods are employed in successfully learning FA. The methods used in
successfully learning FA is surveyed in the following chapters.
3
Experimentation – a form of learning where learner is able to experiment with chosen strings (i.e.
selected by learner and not from training set provided) during training.
19. Chapter 3: Non-Probabilistic Learning 11
3.Non-Probabilistic Learning for FA
In building a hypothesis, h, for an unknown FA, c, the learning algorithm, L, usually
receives information (i.e. labelled strings) describing c from a training set, t. L is to build an h
that is equivalent to c with the information it received insofar. Ideally, h is to be exactly the
same as c. However in practical, as c is unknown, the teacher usually may not have complete
information required to build the exact FA and h is then taken to be an approximation to c to
some extent to be specified in the background knowledge (i.e. approximately equivalent to c
or probably approximately correct h than the usual exact h).
Learning relies on L to make several guesses based on information provided by the
teacher in the following ‘ways’ to be discussed in this chapter:
a) learning with queries, section 3.1
b) learning without queries, section 3.2
c) learning with homing sequences, section 3.3
L is to make guesses about c through a number of tentative hypotheses (i.e. tentative
FAs), M’, from the information received. Each guess is a refinement or modification to the
previous guess (hypothesis) where new properties of FA (i.e. the characteristic and elements
of FA) are discovered. The guess made by L is also called a conjecture. The learner will
produce several conjectures until the learning goal is achieved, that is, a final conjecture is
accepted as the equivalent FA to c.
All information received and properties learnt through the modifications are kept in a
data structure. The modification to the data structure is called an update and a new
hypothesis is built based on the updated data structure. Hence, the data structure has several
roles:
a) a representation of properties (to be learnt) of an FA :
• the finite number of states
• transitions (representing the transition function)
• the set of distinguishing strings
• the accepting and rejecting states
b) a record of modifications made (i.e. updates)
• incorporating more information received: strings in t
• updating more properties learnt
c) a reference to build next tentative FA, M’, after each update
The data structures being used in this chapter by the learner are briefly explained below,
detailed explanation on the updates are given in the relevant section in brackets:
1. observation table (see section 3.1.1)
A two dimensional table, in Figure 3.1, where the rows correspond to the states
and the columns correspond to the set of distinguishing strings for FA. The
entries in the table are values of ‘0’ and ‘1’ corresponding to the transition
function of FA to a rejecting and accepting states respectively.
20. Chapter 3: Non-Probabilistic Learning 12
e1 e2 … Distinguishing strings = {e1 , e2,…}
s1 1 0 States = {s1, s2, …}
s2 0 0 transition function, δ(q,x)
s3 = 0 (= qx is a rejecting state)
: = 1 (= qx is an accepting state)
: for some string x from state q
Figure 3.5: Observation table representing elements of FA: states (rows), distinguishing
strings (columns) and transition function (table entries in shaded section)
2. classification tree (see section 3.1.2)
A binary classification tree where the leaves correspond to the states in FA and
the distinguishing strings are represented by the internal nodes (and root) of the
tree, as shown in Figure 3.2. The left and right paths from an internal node
correspond transition function of FA to a rejecting and accepting states
States = {s1, s2, s3, …}
Root (d1) Distinguishing strings = {d1, d2, …}
transition function, δ(q,x)
d2 = left path (= qx is a rejecting state)
respectively. = right path (= qx is an accepting state)
s1 for some string x from state q
d3
Figure 3.6: Classification tree representing elements of an FA: states (leaves), distinguishing
strings (internal nodes including root) and transition function (the right and left paths).
: s2
3. minword(q) (see section 3.2.1)
A string used to reach a state q in an FA from initial state q0. Thus, the set of
minword(q) corresponds to the states in an FA as shown in Figure 3.3.
λ q0
0
q1 0,1
0
1
1 1 0 λ 0
(a) (b)
0
q2 q3 1
q0 q1
0
minword(q0) = λ
minword(q1) = 0 minword(q0) = λ
minword(q2) = 1 minword(q1) = 0
minword(q3) = 01
Figure 3.7: The set of minword(q) for FAs. (a) four minword(q) representing the states in the
FA that accepts all strings with even 0’s and 1’s. (b) two minword(q) representing the states
in the FA that accept all non-empty strings.
21. Chapter 3: Non-Probabilistic Learning 13
3.1Learning with queries
Additional information regarding the unknown c can be requested by L by asking queries
[Angluin 88]. The queries can be equivalence query, membership query, subset query,
superset query, disjointness query and exhaustiveness query. Two of the six queries are used
in the following two algorithms, L1 and L2, (see section 3.1.1 and 3.12) in learning c:
1. Membership queries
The teacher returns a yes/no answer when the learner presents an input string, x,
of its choice in the query, depending upon whether x is accepted by the unknown
FA, c.
2. Equivalence queries
The teacher returns a yes answer if the conjecture, M’, is equivalent to c and
otherwise returns a counterexample, y, which is a string in the symmetric
difference of M’ and c, if M’ is not equivalent to c.
Hence, L has access to some oracle (could be the teacher or from some operations
available in the environment), creating an active interaction between the learner and teacher
in the learning process (Figure 3.8). Both queries require a pair of oracles where each oracle
is used in separate stage of learning:
a) Phase 1 of learning: updating the data structure used to construct the conjecture,
M’
b) Phase 2 of learning: to confirm M’ as a finite representation of c (i.e. when to stop
learning)
C T H
teacher t
c h
L
h
L
Oracle(s)
Environment
Figure 3.8: Learning with additional information obtained through access to oracle in the
environment.
22. Chapter 3: Non-Probabilistic Learning 14
3.1.1L1: by Dana Angluin [Angluin 87]
The observation table (e.g. Figure 3.5) is the data structure used to stores the information and
learnt properties about the unknown FA, c. All rows and columns are represented by strings
based on information from the training set t and the set of distinguishing strings learnt
respectively.
Each row is viewed as a vector with attributes values of ‘0’ and ‘1’ (i.e. the ‘0’ and
‘1’ table entries in each row corresponding to each column) representing a state in c. Thus,
the string representing each row also represents a state in c. A string is said to represent a
state q when it can be used to reach q from the initial state q0. The vectors are used to
distinguish the rows, thus, distinguishing the states in c.
Alternatively, each row can be viewed as a set of distinguishing strings e where each
e represents a column in the table. The table entries of ‘1’ and ‘0’ in a row depends on
whether e (for the corresponding column) is the distinguishing string or not to the row (state
represented) respectively.
There may be rows with the same vector (i.e. with the same set of distinguishing
strings) and by the Myhill-Nerode Theorem of equivalence classes, these rows are said to be
equivalent to each other, that is, representing the same equivalence class x. Thus, we use the
alternative view of a row above in referring to the distinct states represented by these rows.
The distinct state, that is, the equivalence class x, is represented by the distinct row vector.
From Figure 3.5, there are only two distinct rows, s1 and s2, with vectors ‘0’ and ‘1’
and strings λ and 0 respectively. The rest of the rows have the same vector ‘1’ as row s2. Thus,
there are only two distinct states represented by a set of strings {λ} and {0,1,00,01} each. The
sets of distinguishing strings are φ and {λ} for the two distinct states respectively.
e1 = λ
Rows: s1…s5
s1 = λ 0 Columns: e1
s2 = 0 1 training set t= {-λ, +0, +1, +00, +01}
s3 = 1 1 distinguishing strings = {λ}
s4 = 00 1 States: s1, s2
s5 = 01 1
Figure 3.9: Observation table with five rows representing two distinct states with string from
t.
We now specify the three main elements in the observation table O, as shown in
Figure 3.6, used by the learner L1 during learning to represent properties and information of
c:
1. A non-empty prefix-closed* set of strings, S.
This set starts with the null string, λ. All the rows in the observation table are
each represented by strings in S∪S.A.
There are two distinct divisions of rows in O: the upper division (i.e. as shown by
the shaded rows in Figure 3.6) of the table is represented by the strings in S and
the lower division is represented by strings in S.A. Each row in the upper division
is the particular state reachable through some s∈S from the initial state q0. The
*
A prefix-closed set is where every prefix of each member is also an element of the set.
23. Chapter 3: Non-Probabilistic Learning 15
rows in the lower division of O therefore represent the next-states reached
through transitions a∈A from rows in the upper division.
Thus, S represents the states discovered (learnt) by learner in the course of
learning.
2. A non-empty suffix-closed** set of string, E.
This set also starts with an initial null string, λ. The columns in the observation
table
are represented by the strings in this set.
The vector for each row is a collection of strings from E. Thus the distinct subsets
of E represented by the distinct row vectors are used to identify the distinct states
represented by the strings in S ∪ S.A. From Figure 3.6, each φ, {λ} ⊆ E
(represented by vectors ‘0’ and ‘1’ respectively) identifies the two distinct states
represented by {λ} and {0,1,00,01} in S ∪ S.A.
Thus, E represents the characteristics of states learnt through subsets of strings in
E.
3. A mapping function, T: (S ∪ S.A).E {0,1} where
T(x.e) = ‘1’ if the string x.e ∈ c and ‘0’ otherwise with x ∈ S ∪ S.A .
Thus, this mapping function represents the transition function of FA, δ(q0,xe).
E
λ Rows: s1…s5
s1 = λ 0 S S = {λ, 0}
s2 = 0 1 E = {λ}
s3 = 1 1 S∪S.A = {λ, 0, 1, 00, 01}
s4 = 00 1 S.A table entries: T(x.e)
s5 = 01 1 where x∈ S ∪ S.A , e∈E
Figure 3.10: Observation table O with upper division (shaded section) and lower division of
rows from the set S∪S.A.
Each of the following two properties of O, closed and consistent, are used by L1 as a
guide to carry out updates (i.e. the extension of rows and columns) during learning:
a) closed
As the lower division in O are next-states of previous states in the upper division
on taking transitions on symbols in A, the row vectors in the lower division must
therefore also exist in the upper division , the closed property of O.
Thus, for every string, s’, in S.A, there is an s in S where both strings, s’ and s,
have the same vector. As shown in Figure 3.6, the vectors in the lower division of
O existed in the upper division where all next-states are existing states.
b) consistent
Each pair of vectors with the same subset of distinguishing strings should
represent the same state. The next-state vectors from this pair of vectors should
be the same vector representing the same next-state reached, the consistent
property of O.
Thus, any pair of strings, s1, s2 in S, with the property of row(s1)=row(s2), then
row(s1.a) = row(s2.a) for all a in A. As shown in Figure 3.7, the rows represented
by strings λ and 11is to be consistent when both strings is representing the same
**
A suffix-closed set is where every suffix of each member is also an element of the set.
24. Chapter 3: Non-Probabilistic Learning 16
distinct state represented by vector (10) moves into the same next-state(s)
represented having the same vectors.
λ 0
λ 1 0 previous state
0 0 1
1 0 0 next-state
11 1 0 previous state
00 1 0
01 0 0
10 0 0
110 0 1
111 0 0 next-state
Figure 3.11: Observation table which is consistent, where two rows λ and 11 represented by
the same row vector (1 0) has the same row vector (00) representing the same next-state
reached from both rows in the upper division (shaded region).
The observation table O is updated by extending the rows and columns (discovering
more states and the characteristics of each state) using membership queries and equivalence
queries, as shown in Figure 3.8 and Figure 3.9. An update is carried out in two circumstances:
T1 λ
S ∪ S.A = {λ, 0, 1}
λ 0
T(λ.λ) = 0
row vector not in 0 1 T(0.λ) = 1
upper division T(1.λ) = 1
1 1
make closed: S ∪ {0}
T2 λ
λ (a)
0
S = {λ, 0}
(b) newly added row 0 1 E = {λ}
with vector (0) S ∪ S.A = {λ, 0, 1, 00, 01}
1 1
00 1
01 1
Figure 2.12: (a) Observation table T1 not closed with vector (0). (b) a closed T2 with extension
to T1 adding new row into table representing new state discovered.
a) when either one of the close and consistent properties of O does not hold:
• O is not closed when a vector is not represented in the upper division. A
new state is said to be discovered (learnt) as it is a non-existing next-
state. From Figure 3.8(a) the row with vector (0) in the lower division is
not represented in the upper division, indicating that the next-state is not
an existing state.
Then O is updated by
S ∪ {s’} where s’ ∈ S.A
Thus, Figure 3.8(b) shows the updated O with new string (row) in S and
new row in the upper division representing the new state learnt.
25. Chapter 3: Non-Probabilistic Learning 17
[Adding s’ into S still maintains the prefix-closed property of the set as s’
is an element of S being appended with an input letter from the alphabet.]
Note: Membership queries are used to complete the table entries whenever E or S
is extended. The queries are made on strings in the (S ∪ S.A).E where a yes
answer from the teacher means a ‘1’ entry in O and vice versa.
• O is inconsistent when two rows with the same vector have a pair of
different next-state vectors. This indicates that one of the pair of strings s,
s’ in S actually represent a different (newly discovered) state not in the
existing states (rows). As in Figure 3.9(a), pair of rows with same vector
lead to different next-state on transition ‘1’ in O1.
Then O is updated by
E ∪ {a.e} where a is the transition symbol which brought the two
states to a different next-state and e is the element in E where the
next-state vector differs (i.e. at one of the attributes).
Thus, Figure 3.9(b) shows the updated O2 with an extra column
represented by string ‘1’, the transition symbol which brought the pair of
rows to different rows. The e element previous E where the difference is
seen is λ. All the table entries resulting from this additional column are
filled in using membership queries on the new (S ∪ S.A).E.
[The suffix-closed property of E is also maintained with ‘a.e’ added to E,
where e is the previous suffix element being added to the set before
‘a.e’.]
O1 λ 0 O2 λ 0 0
λ 1 0 λ 1 0 0
0 0 1 same: 0 0 1 0
01 0 0 current state 01 0 0 0
010 0 0 current state 010 0 0 1 new row
1 0 0 1 0 0 1 (new vector)
00 1 0 different: 00 1 0 0
011 0 1 next-state 011 0 1 0
0100 0 0 0100 0 0 0
0101 1 0 next-state 0101 1 0 0
The e column which the rows Make consistent
differs at shaded entries, ‘0’ E∪{1.λ}
and ‘1’
Figure 3.13: (a) O1 is inconsistent with different next-state vector for a pair of rows with same
vectors representing same state. (b) An updated O2 obtained with newly learnt state
represented by the new row with new vector in the upper division of new table.
c) when a counterexample y is returned from an equivalence query
S is extended during learning to include all the prefixes of y. Thus, the upper division
of table is extended with new strings and membership queries are used to fill in all
new entries.
We now have the questions of “when to built a tentative M using the data
structure?” and “how a tentative M’ is built from data structure?”. The tentative M’ in
Figure 3.10(a) is built only when the observation table O has both the properties of closed and
consistent, as in (a) where all upper rows with the same vectors leads to a pair of rows with
the same vector and all vectors in the lower division are represented in the upper division.
26. Chapter 3: Non-Probabilistic Learning 18
The latter question is answered with a closed and consistent O. This closed and
consistent O is used to build a tentative deterministic FA (DFA), M’, with each distinct
vectors (i.e. distinct rows) in the upper division representing a state in M’. Then, the M’ is
completed by having transitions on all symbols in A from every states. The next-state here is
determined by a look-up at the rows represented by the string s.a (i.e. the resulting string from
taking a transition a from row s) in the table for the corresponding vector to each strings.
From Figure 3.10(b), the conjecture M’ is built from the closed and consistent
observation table O, in Figure 3.10(a). The states in M’ are the distinct vectors in the upper
division, which are each shared among strings representing all the rows in O. M’ is the
minimum DFA that accepts all non-empty strings as equivalence query on M’ returns a yes
answer.
O Col1 (= λ) s2 = {0, 1,00,01}
s1 = λ T(λ) = 0 Figure 3.14: (a) s1 = {λ}
distinct vector
s2 = 0 1 The distinct vector
observation 0,1
s3 = 1 1 table, the final the
(also O, for state)
s4 = 00 1 unknown FA, c, λ
s5 = 01 1 recognising the
set of all non- M’ 0,1
empty strings. The rows are elements of S ∪
S.A and columns are elements from E. (b) The (b)
conjecture, M’, is constructed using the closed
E = {λ}
and consistent O. The final state being the row having vector (1) (bold arrow). λ is always the
S = {λ, 0}
initial state being the first row in the table, a non-accepting state in this case. The next-state
(a)
transitions are the strings {0,1,00,01}.
Each conjecture, M’, is then presented to the teacher in the form of an equivalence
query. At this point, if the guess is correct, no counterexample is returned and M’ is the
minimum DFA equivalent to c, as in Figure 3.10(b) where an equivalence query on M’
returns a yes. Thus L1 stops learning and output M’ as its hypothesis. The conjecture M’ is a
minimum DFA representing the unknown FA.
However, if a counterexample is returned, an update is carried out to the observation
table (i.e. adding all prefixes to S) and another update if the updated table with additional
prefixes is not closed and/or inconsistent. Next conjecture is built when both properties are
satisfied. Membership queries are used to fill in new entries for the new rows obtained from
the extended S where counterexample and its prefixes are the learner’s choice in presenting
membership queries.
This minimality on the number of states in the conjectured DFAs is maintained by the
closed and consistent property of the observation table. Through the consistent and closed test
on every updated table, two rows that have the same vector is considered as belonging to the
same equivalent class by Myhill-Nerode Theorem (i.e. class x with the same behaviour for a
set of distinguishing strings). Thus, building a conjecture only if a closed and consistent
observation table is obtained after every update and taking only the distinct vectors as
representing the distinct states in a building DFA always results in a minimum DFA.
“How to start learning?”. This question brings us to the important role of the null
string λ, which both S and E starts with as the first element. This string not only brings us to
27. Chapter 3: Non-Probabilistic Learning 19
discovering the initial state q0 (being the first row in the table) but also as the distinguishing
string which is uses to decide which of the distinct vectors are accepting or rejecting states.
Being the first element of E therefore allows every string in all rows to be queried by the
learner whether it is accepted or rejected by c in the membership queries. Thus, a row which
has the λ as its set of distinguishing strings indicate that the vector with ‘1’ entry in the λ
column must represents an accepting state as the string represented by that vector is accepted.
From Figure 3.10(a), the vector (1) for row with string ‘0’ represents an accepting
state as ‘0’ is accepted at column represented by λ, which is also in the set of distinguishing
strings for row ‘0’ indicated by the ‘1’ entry. Learning process thus starts with S and E having
only one element (i.e. the null string) and the initial table with only a column and three rows
(one row for λ in the upper division and the other 2 rows for the next-state rows in the lower
division).
Another illustration in is shown in , with learner trying to learn the FA that accepts all
string with even 0’s and 1’s. The initial table is constructed as Oo which is not closed. L1
updates the table until an equivalence query initiate a termination by a yes answer for
conjecture M1 after five updates (i.e. five observation tables) and two conjectures.
The examples required by the learner are obtained through membership queries and
counterexamples both from the training set t consisting of positive and negative examples (i.e.
the ‘0’ and ‘1’ entries accepted and rejected strings).
O0 λ O1 λ O1
1
λ 1 make close λ 1 S={λ,0} 1,0
0 0 S∪{0} 0 0 E={λ} λ 0
1 0 1 0
S={λ} 00 1 M0 : Equivalence query no (y=010)
E={λ} 01 0
O2 λ O3 λ 0 O4 λ 0 1
λ 1 λ 1 0 λ 1 0 0
0 0 make 0 0 1 make 0 0 1 0
01 0 consistent 01 0 0 consistent 01 0 0 0
010 0 E={λ}∪{0.λ) 010 0 0 E∪{1.λ} 010 0 0 1
1 0 1 0 0 1 0 0 1
00 1 00 1 0 00 1 0 0
011 0 011 0 1 011 0 1 0
0100 0 0100 0 0 0100 0 0 0
0101 1 0101 1 0 0101 1 0 0
O4 is closed and consistent
S = {λ, 0, 01,0 10}
E = {λ, 0, 1} λ 0
0
λ Equivalence query yes
M1
0
1 1 1 1
0
010 01
0
28. Chapter 3: Non-Probabilistic Learning 20
Figure 3.15: Running examples of learning the unknown FA that accepts the set of all strings
with even number of 0’s and 1’s.
29. Chapter 3: Non-Probabilistic Learning 21
3.1.2L2: by Kearns and Vazirani [Kearns et al 94]
This algorithm uses the same principles as L1 (i.e. using membership and equivalence queries
and positive and negative examples) but the data structure used to construct the tentative FA,
M’, is a classification tree, as shown in Figure 3.19. The leaves of the classification tree
represent the states learnt (known) in c and the nodes represent the distinguishing strings
required to distinguish (discover) the states in c. All the nodes and leaves are represented by a
string each, based on the information received from counterexamples and membership queries
on chosen strings.
Root (d1= λ)
T1
Nodes : d1, d2, d3
Leaves: s1, s2, s3, s4
d2 = 0 States : s1, s2, s3, s4
s1 = λ Distinguishing strings = {λ, 0, 1}
training set t = {+λ, -0, -1, -01}
d3 = 1
s2 = 0
s4 = 01 s3 = 1
Figure 3.16: Classification tree, T1, with 3 node representing 3 distinguishing strings and 4
leaves each representing an equivalence class.
The Myhill-Nerode Theorem is also adopted by L2, that is, maintaining the set of
distinguishing strings that distinguishes between the equivalent states represented as leaves in
the tree. The leaves can be viewed as the equivalence class x containing a set of strings
(representative strings) having the same behaviour (distinguishability) with respect to c and
the set of distinguishing strings. Thus, each node is seen as the distinguishing string between
the children on its right subtree and left subtree (i.e. the leaves in the subtrees) to accepting
strings and rejecting string respectively. In Figure 3.12, the node d3, represented by string 1,
distinguishes between the leaves, 01 and 1, in its right and left subtree respectively with
respect to the FA that accepts all strings with even 0’s and 1’s.
The next-state which a string x reached with transition symbol a is determined by
traversing the tree with the string xa starting from the root until a leaf s is reached. At each
node d visited, the next path to take depends on whether the string xad is accepted or rejected
by c. The right path is taken if xad is accepted by the unknown FA and left path if otherwise.
The leaf s reached is the equivalence class where x belongs. Thus, xa is said to represent the
state represented by s. The membership queries are used here to query which path to take with
xad being the string of the learner’s choice.
As in Figure 3.12, the string 01 when traversed through the tree landed up in leaf s4
where the string 011 is rejected at node d3 in T1 by taking the left path to leaf s4. However, in
T2 from Figure 3.13, it is accepted as the traversal reached a right leaf s1 from the root d1.
Thus, the string 01 is said to represent state s4 and s1 in T1 and T2 respectively with respect to
the FAs being learnt.
Figure 3.17: Classification tree, T2 with one node and two leaves represented by the strings in
Root Nodes : d1
T2 (d1 = λ) Leaves: s1, s2
States : s1, s2
Distinguishing strings = {λ}
training set t = {-λ, +01}
s2 = λ s1 = 01
30. Chapter 3: Non-Probabilistic Learning 22
The classification tree, T, maintains two main elements to represent the properties
learnt of c and also the information received from the training set. The elements are specified
as the following:
1. a set of access strings, S
The initial set contains only one string, the null string λ. The leaves in T are each
represented by strings in S. All the leaves then represent the known states
discovered insofar of the unknown FA. The leaves in left subtree of root contains
all s in S that are rejected by c and the leaves in the right subtree of root are
strings that are accepted by c. Thus, S is subdivided into 2 subsets of accepting
and rejecting states (i.e. the leaves).
From Figure 3.13, S is the set of strings representing the leaves and the 2 subsets
for T2 are {λ} and {01} where both sets are the positive and negative set of
examples from t.
2. a suffix-closed set of distinguishing strings, D
The initial D starts with the null string, λ, and is used to distinguish each pair of
access strings in S. The strings in D represent the nodes of the T. Each node, d,
has exactly two children, distinguishing a pair of strings in S such that the right
subtree consists of strings s.d that are accepted by the unknown FA at node d and
vice versa.
(a) t = {+λ}
+
As in Figure 3.13, the root is the distinguishing string for the leaves in the right
and left subtree from the root where strings λ and 01 representing the leaves are
also representing the rejecting and accepting states respectively.
The classification tree is used to build every conjecture M’ except the initial
conjecture (i.e. the learner’s first guess) of FA, M0. There are only two different initial
conjectures to choose from as M0, as shown by the initial conjectures in Figure 3.14(a) and
Figure 3.14(b) which has only the single start state with all transition to itself. This initial
guess depends on a membership query on the null string λ. Thus, M0 either accepts or rejects
the set of all strings depending on whether the λ is accepted by the unknown M, that is, the
initial state is an accepting state if λ is accepted by M and vice versa.
T0
0
M
+λ
(a) 0,1
λ y +λ
M0
(b) 0,1
-λ T0
-λ y
λ
(b) t = {-λ}
Figure 3.18: (a) M0 accepting all strings as t provide the positive example, +λ. (b) M0 accepts
the empty set where t provide the negative example, -λ. The corresponding tree T0 is an
incomplete tree to be completed with the counterexample returned after an equivalence query
on M0.
31. Chapter 3: Non-Probabilistic Learning 23
As in L1, for every conjecture M’ produced, an equivalence query on M’ is presented
by the learner. L2 terminates if no counterexample is returned. Thus, the first guess from the
learner is whether all strings are accepted or rejected by the unknown M. The first
counterexample represents the remaining unrepresented leaf y in the incomplete T0 for the
initial conjectures, as in Figure 3.19, where y is the one of the leaves for each tree. The
subsequent counterexample y returned is analysed using the divergence concept (see below)
and a current classification tree, T.
From Figure 3.15(a), the unknown FA accepting all non-empty strings results in the
initial M0 being the DFA accepting the empty set. An equivalence query to M0 returns the
counterexample string +01. As λ is rejected at the root, the first classification tree, T0, as
shown in Figure 3.19(b), has λ as its left child and the counterexample 01 as its right child.
M0 λ
λ
0,1 T0 Root (≡ λ) M’
S = {λ} λ 01
D = {λ} 0,1
λ
Equivalence query (y=01) λ y=01 S = {λ, 01} 0,1
(a) D = {λ}
S = {λ, 01}
D = {λ}
counterexample, y = 01 Equivalence query yes
(b) (c)
Figure 3.19: An unknown FA that accepts the set of all non-empty strings is being learnt; (a)
0 0
The initial conjecture, M ; (b) The classification tree, T , with 2 leaves from S and a node
(root) from D; (c) The conjecture, M’, is constructed using the classification tree in (b) with
the first leaf, λ as the initial state and the second leaf, represented by string, 01, as the other
state in M’. As 01 is a leaf on the right subtree from root, the final state is represented by the
leaf, 01.
32. Chapter 3: Non-Probabilistic Learning 24
In Figure 3.15(b), T0 is then used to build a tentative deterministic FA (DFA), M’
(Figure 3.19(c)), using the leaves λ and 01 to represent the states of M’ where all states in M’
are labelled with the leaves of T0. The next-state transitions for every states in M’ is done by
traversing T0 with the string representing each state appended with a transition symbol a. The
strings used for the next-state transition are {011,010, 0, 1}. The next equivalence query
returns a ‘yes’ which terminates learning, as in Figure 3.15(c).
From every counterexample y, each prefix of y is analysed to determine the prefix yi
that leads to different states when it is tested on both the current T and the conjecture M’
which returns y in the equivalence query on M’. Both tests will result in a pair of states: a leaf
and a state from T and M’ respectively. Since M’ is built by taking all the leaves in T to
represent the states in conjecture, then the pair of states from the tests should point to the
same state for a string if T and M’ are equivalent. Thus, there must be a node and transition
symbol (path) that yi takes leading to first different pair of states. This is called the divergence
point.
Thus, a counterexample indicates that somewhere along the string, y = y1…yn for n
input symbols in y, at one of the prefix, ym, M’ and T diverge into a different path leading to
different state. The divergence point is
ym-1 (i.e. the immediate prefix before ym where pair of different
states sM and sT is obtained from M and T respectively in the
test.
where ym is the prefix where divergence occurs
The current tree, T, is used to trace the common ancestor for both sM and sT, that is, the node
d, that distinguishes the leaves represented by sM and sT . Both d and ym-1 are used to update
the classification tree. Figure 3.16 shows how the divergence point is found from
counterexample y.
λ T’ M’ 0,1
λ 0,1
0
λ 0 01
0 λ 1
0
01
y = 11 0
Prefixes y1: {1, 11} y = 11 Equivalence query (M’)
where i = 1, 2
y1 = 1 sM = 01
Divergence point = y1 sT = 01
common ancestor d = 0 y2 = 11 sM = 01
sT = λ
Figure 3.20: The unknown FA accepts all string with even 0’s and 1’s. The conjectured M’ is
returned with counterexample y = 11 in equivalence query. Each prefixes of y is traversed in
T’ and also M’ to find the divergence point, y1 = 1 where divergence occurs at y2. The
common ancestor for 01 and λ where y2 diverged to is 0 from T’.
33. Chapter 3: Non-Probabilistic Learning 25
As the learner learns more information and properties of c, the new information and
properties (regarding new states) are updated in the tree by extending the nodes and leaves.
These updates are carried out only using the information from equivalence query, that is, the
counterexample. The counterexample is then analysed for divergence point and the results
from the divergence analysis are the common ancestor d and the prefix ym-1, representing the
divergence point. Therefore, the tree is updated using d and ym-1 as follows:
a) new access string, ym-1 (i.e. a prefix of y), to add to S
a new state represented by ym-1 is discovered with a new leaf is to be updated onto
T as S is extended.
[S is extended to include the prefix of a counterexample representing a known
state of c discovered as shown in Figure 3.17 by the shaded leaf 1]
b) new distinguishing string, a.d, to add to D
where d: the most common ancestor for sT and sM
a: the input letter that leads ym-1 to sT and sM (i.e. ym = ym-1.a)
ym = y1…ym (the prefix of y)
ym = mth symbol in y
sT and sM : states reached in T and M
[D is extended when the counterexample is returned and a prefix (i.e. to be
included in the extension of S) of the counterexample is identified as the point of
divergence. D is to include a new distinguishing the string, a.d, where d is the
common ancestor node and ‘a’ is the input letter leading from divergence point
the two different strings reached. As shown in Figure 3.17 where the new node is
shaded.]
The new extensions involve sT being replaced by the new string a.d forming a new
internal node. The leaves, sT and ym-1, are the children of the new node and their position
depends on the acceptance of each concatenated with a.d through membership queries. The
suffix-closed property of D maintains reachability from other states to the final state(s) each
time a new state (leaf) is discovered (i.e. added into S). Figure 3.17 shows how the tree T’ in
Figure 3.16 is updated and used to build a new conjecture M”.
λ 0
λ 0
λ
T”
0 0
λ 1
1 1 1
1 0 New conjecture
0
01 1 0
1 01
Figure 3.21: The new updated tree from Figure 3.16, T”, has new node and new leaf (in
shades) being added when a divergence point is found in previous counterexample. New
conjecture M” is queried for equivalence query and no counterexample returned.
34. Chapter 3: Non-Probabilistic Learning 26
We illustrate another example of L2 learning the FA that accepts the strings with even
0’s and 1’s in Figure 3.18 below. The divergence point in done and represented by ym.
Conjecture Equivalence Classification tree
query
λ
M0 λ no T0
0,1 y = 01 S={λ,01}
D={λ) 01 λ
λ
λ
M1 0,1 0,1 no T1
λ 0
y = 00 S={λ,01,0} λ
λ 01 ym = 00 D={λ,0} 0
01
λ
0,1
M2 λ 0,1 no T2 0
0 01 y = 11 S={λ,01,0,1} λ
λ 1 ym = 11 D={λ,0,1}
0 1 0
0
01 1
λ 1
M3 1 yes
λ
1
0 0
0 0
1
0 01
1
Figure 3.22: Running example of learning the unknown FA that accepts set of strings with
even number of 0’s and 1’s.
Therefore, the access strings in S are prefixes from counterexamples where the size
of S is also the number of counterexamples returned (or number of equivalence queries
made). L2 maintains a fixed-size S where each string represents a distinct state of the
minimum DFA of M. The size of S is at most to the size of the minimum DFA for M at any
point of time during the learning process. Hence, each counterexample produces a new access
string that immediately creates a new conjecture with a newly discovered state.