SlideShare a Scribd company logo
1 of 83
A Survey of Learning Methods In
 Learning Finite Automata (FA)




      Submitted By: Priscilla Chia
Supervised By: Dr. Suresh K. Manandhar




    Final Year Project Report 1998




     Computer Science Department
         University Of York
Abstract

This report is a survey of learning methods used learning Finite Automata (FA). The

learning issues in machine learning are highlighted and the methods surveyed are

analysed according to how these issues are dealt with. The report also looks at how

additional information is learnt based on given information by the teacher. We

surveyed six algorithms with respect to the various learning methods employed in the

learning process: building a hypothesis and evaluating the hypothesis. The methods

are categorised into probabilistic and non-probabilistic. We conclude with a

discussion on the ability of hypothesis towards error rectification of past experience

instead of only learning new ones.
Acknowledgement


I am very grateful to my supervisor, Dr. Suresh Manandhar, for his invaluable help

and advise throughout this project. I would also like to thank my friends and family

for their support especially mum and dad at home.
CONTENTS


1. INTRODUCTION...........................................................................................1

2. LEARNING....................................................................................................2
2.1 Learning in General...........................................................................................................................2

2.2 The Issues in Learning.......................................................................................................................5

2.3 Learning Finite Automata (FA)........................................................................................................6

2.4 Learning Framework.........................................................................................................................7
2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7
2.4.2 PAC VIEW............................................................................................................................8
2.4.3 COMPARISON.........................................................................................................................8
2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9

2.5 Results on learning finite automata................................................................................................10


3. NON-PROBABILISTIC LEARNING FOR FA.............................................11
3.1 Learning with queries......................................................................................................................13
3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14
3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21
3.1.3 DISCUSSION.........................................................................................................................27

3.2 Learning without queries................................................................................................................29
3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30
3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35
3.2.3 DISCUSSION.........................................................................................................................39

3.3 Homing Sequences in Learning FA................................................................................................42
3.3.1 HOMING SEQUENCE...............................................................................................................43
3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43
3.3.3 DISCUSSION.........................................................................................................................46

3.4 Summary (Motivation forward).....................................................................................................47


4. PROBABILISTIC LEARNING.....................................................................50
4.1 PAC learning using membership queries only..............................................................................50
4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50
4.1.2 DISCUSSION.........................................................................................................................51

4.2 Learning through model merging..................................................................................................54
4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
4.2.2 LEARNING FA: REVISITED....................................................................................................55
4.2.3 BAYESIAN MODEL MERGING.................................................................................................57
4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60
4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64
4.2.6 DISCUSSION.........................................................................................................................67
4.3 Summary...........................................................................................................................................68

4.4 Chapter Appendix............................................................................................................................70


5. CONCLUSION AND RELATED WORK.....................................................71

 REFERENCES...............................................................................................74

1. INTRODUCTION...........................................................................................1

2. LEARNING....................................................................................................2
2.1 Learning in General...........................................................................................................................2

2.2 The Issues in Learning.......................................................................................................................5

2.3 Learning Finite Automata (FA)........................................................................................................6

2.4 Learning Framework.........................................................................................................................7
2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7
2.4.2 PAC VIEW............................................................................................................................8
2.4.3 COMPARISON.........................................................................................................................8
2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9

2.5 Results on learning finite automata................................................................................................10


3. NON-PROBABILISTIC LEARNING FOR FA.............................................11
3.1 Learning with queries......................................................................................................................13
3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14
3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21
3.1.3 DISCUSSION.........................................................................................................................27

3.2 Learning without queries................................................................................................................29
3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30
3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35
3.2.3 DISCUSSION.........................................................................................................................39

3.3 Homing Sequences in Learning FA................................................................................................42
3.3.1 HOMING SEQUENCE...............................................................................................................43
3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43
3.3.3 DISCUSSION.........................................................................................................................46

3.4 Summary (Motivation forward).....................................................................................................47


4. PROBABILISTIC LEARNING.....................................................................50
4.1 PAC learning using membership queries only..............................................................................50
4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50
4.1.2 DISCUSSION.........................................................................................................................51

4.2 Learning through model merging..................................................................................................54
4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
4.2.2 LEARNING FA: REVISITED....................................................................................................55
4.2.3 BAYESIAN MODEL MERGING.................................................................................................57
4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60
4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64
4.2.6 DISCUSSION.........................................................................................................................67

4.3 Summary...........................................................................................................................................68

4.4 Chapter Appendix............................................................................................................................70


5. CONCLUSION AND RELATED WORK.....................................................71

 REFERENCES...............................................................................................74

1. INTRODUCTION...........................................................................................1

2. LEARNING....................................................................................................2
2.1 Learning in General...........................................................................................................................2

2.2 The Issues in Learning.......................................................................................................................5

2.3 Learning Finite Automata (FA)........................................................................................................6

2.4 Learning Framework.........................................................................................................................7
2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7
2.4.2 PAC VIEW............................................................................................................................8
2.4.3 COMPARISON.........................................................................................................................8
2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9

2.5 Results on learning finite automata................................................................................................10


3. NON-PROBABILISTIC LEARNING FOR FA.............................................11
3.1 Learning with queries......................................................................................................................13
3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14
3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21
3.1.3 DISCUSSION.........................................................................................................................27

3.2 Learning without queries................................................................................................................29
3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30
3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35
3.2.3 DISCUSSION.........................................................................................................................39

3.3 Homing Sequences in Learning FA................................................................................................42
3.3.1 HOMING SEQUENCE...............................................................................................................43
3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43
3.3.3 DISCUSSION.........................................................................................................................46

3.4 Summary (Motivation forward).....................................................................................................47


4. PROBABILISTIC LEARNING.....................................................................50
4.1 PAC learning using membership queries only..............................................................................50
4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50
4.1.2 DISCUSSION.........................................................................................................................51

4.2 Learning through model merging..................................................................................................54
4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
4.2.2 LEARNING FA: REVISITED....................................................................................................55
4.2.3 BAYESIAN MODEL MERGING.................................................................................................57
4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60
4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64
4.2.6 DISCUSSION.........................................................................................................................67

4.3 Summary...........................................................................................................................................68

4.4 Chapter Appendix............................................................................................................................70


5. CONCLUSION AND RELATED WORK.....................................................71

 REFERENCES...............................................................................................74

References                                                                                                                                         72
Introduction                                                                                  1



1.Introduction

The class of finite state automaton (FA) is studied from machine learning perspective which
involves learning issues and the properties of that particular class. This report is a survey on
the learning methods studied and employed in learning FA. We give an overview on learning
in general in Section 2.1 and the issues of learning in Section 2.2 with application towards
learning FA in Section 2.3. The two important frameworks employed extensively in machine
learning which are learning in the limit and PAC learning, are explained in Section 2.4. The
complexity of learning FA itself has been studied and the results are given in Section 2.5.

        This report which concerns the learning methods employed are divided into two main
chapters where various learning algorithms are studied and compared. The usual non-
probabilistic learning is discussed in Chapter 3 with the motivation towards probabilistic
learning in Section 3.4 before the probabilistic learning is discussed in Chapter 4.

       The results of the surveyed is in every chapter and the conclusion with related works
in machine learning is in Chapter 4. The are 6 algorithms discussed and each is referred to as
L1-L6 throughout this report as the to the following algorithms:

        •   L1: [Angluin 87]
        •   L2: [Kearns et al 94]
        •   L3: [Porat et al 91]
        •   L4: [Rivest et al 87]
        •   L5: [Anlguin 87; Natarajan 91]
        •   L6: [Stolcke et al 94]

We follow the standard definition of FA as studied in any automata theory [Hopcroft et al 79;
Trakhtenbrot 73] and give the following terminology and notation that are used for any FA M:

set of states, Q : the set of finite states q in FA
final state       : the start reached by an input string that is not recognised by M
initial state, q0 : the start state for all input strings
accepting state : final state which accepts a string that is not recognised by M
rejecting state : final state which rejects a string
transition, δ(q,a): path from a state q with symbol a from alphabet set
alphabet set, A : the set with finite symbols and binary set {0,1} is applied.
Chapter 2: Learning                                                                             2




2.Learning


2.1Learning in General
Learning in general means the ability to carry out a task with improvement from previous
experience. It involves a teacher and a learner. The learning process usually takes place in an
environment which constrains the communication between the learner and teacher: how the
teacher is to teach or train the learner and how the learner is to receive input from the teacher;
and elements or tokens of information that is communicated between the teacher and learner:
a class of objects (i.e. a concept) and a description of a subclass (i.e. an object).

        Example 1(a): Environment for learning a class of vehicles
              The environment in which the learning process takes place involves a teacher
              giving descriptions of ships and a learner drawing a conclusion of how a ship
              looks like from descriptions received. The teacher describes ship (i.e. the
              subclass of vehicles) by providing pictures (i.e. using pictorial means) of
              ships and the learner responds (i.e. communicates with the teacher) through
              some form of visual mechanism (i.e. by detecting shapes or colours of object
              in pictures) to analyse pictures received from the teacher. This environment
              only allows the teacher and learner to communicate through pictures, whereas
              in another environment, other forms of encoding of descriptions may be used
              (e.g. tables of attributes - width, length, windows, engine capacity etc.)


                 A              c1                      B
                                :
                                cn
                                     ]             :
                                                                              r1

                                 :                 :                     :
                                cs                                       :

                                 :
                                cm
                                     ]                                        ri


Figure 2.1: Finite representation of a possibly infinite class A of m elements cs for 1≤ s ≤ m
where m may be finite of infinite, by another finite class B with p elements, ri for 1≤ i ≤ p
where p is some finite number.


         The learner is to learn an unknown subclass from the class with the help (i.e. some
form of training) from the teacher who provides descriptions of the unknown subclass. Since
the subclass to be learnt may be infinite in size, a finite representation is needed to represent
the probable subclasses hypothesised during learning. The task of the learner is to hypothesise
a (finite) representation of the unknown subclass as shown in Figure 2.1 where class A of a
possibly finite set is represented by a finite class B. Thus learning the class A is to learn its
class B representation.

         In Example 1(a) above, the learner is to produce a hypothesis of the ship subclass. A
finite representation for the hypothesis is necessary for the unknown subclass chosen to be
learnt, as not every subclass can be finitely presented or described (i.e. by presenting all
elements of the unknown subclass to the learner) by the teacher, as shown in the class of
vehicles above where the subclass (i.e. ships) is infinite. Note that there are finitely many
Chapter 2: Learning                                                                            3


different ‘types’ of ships as there are finitely many different ‘types’ of vehicles where in both
cases the ‘ships’ and ‘vehicles’ are infinite but the ‘types’ are finite.

         Instances from the class used to describe a particular subclass are called examples.
These examples are usually classified by a teacher (i.e. a human or some operation or
program available in the environment) with respect to the particular subclass being learnt as
positive (a member of the subclass) or negative (non-member of the subclass) examples. A set
of classified or labelled examples is called the example space.

         Only a subset of the example space is used by learner each time an unknown subclass
is to be learnt. This subset, used in training the learner, is known as the training set. Each
example space contains information (implicit properties or rules that may be infinite) relevant
in distinguishing one subclass from another in the given class. The constraints in the
environment also determine which type (i.e. positive only, negative only or both) of examples
can be provided by the teacher to form the example space. For instance, it may not be possible
to collect negative examples and the teacher is restricted to only positive examples which may
not be a partial set (i.e. not all members of the unknown subclass is known even to the
teacher).

        The learner or learning algorithm is therefore required to learn the implicit properties
or rules from the information given (built into what is called experience) of a particular
subclass. The properties learnt are stored in the learner’s hypothesis (i.e. conclusion or
explanation) drawn of the sub-class.

        An infinite number of hypotheses of any form of representation (i.e. decision tree,
propositional logic expression, finite automata etc.) could be produced that hold the properties
obtained from the information received. This results in searching a large hypothesis space. It
should be noted that the hypothesis space could be expressed in the same descriptive
language used to describe the unknown subclass: In Example 1(a), if the class of vehicles are
represented in the form of propositional logic expressions then the hypothesis may be the
exact propositional logic expression that represents the unknown subclass chosen (i.e. ships)
or in some other form of representation that is equivalent to the propositional logic
expressions used.

        A set of criteria is necessary to limit (reduce) the size of the search space. Given a
reduced hypothesis space that satisfies the set of criteria, the learning goal, is then needed in
selecting and justifying a hypothesis from the hypothesis space as the finite representation of
the unknown subclass. Together with other knowledge about the rules to manipulate the
descriptive language, the set of criteria and learning goal form what is called background
knowledge to guide the learner in the learning process.

        Example 1(b): Learning process of Example 1(a)
              Suppose that the hypothesis for ships take the form of a collection of finite
              number of attributes for ship (i.e. size, engine capacity, shape, weight, anchor
              and other properties of a ship). The criteria for hypothesis space could
              include hypotheses that fulfilled five out of say, six attributes used and the
              learning goal is to be able to select hypothesis that satisfy the criteria with the
              simplest data structure of some form and could successfully identify
              subsequent say, ten examples, correctly. There could be infinite number of
              attribute used but the criteria in the background knowledge reduces the
              hypothesis space.
Chapter 2: Learning                                                                             4


         Thus, the learning scenario (Figure 2.2) consists of a given a class, C, of subclasses
and an example space, T, from where the training set, t, is drawn. Examples in T are used to
describe an unknown subclass, c, in C. The aim of the learning algorithm, L, is to produce a
hypothesis, h, from a hypothesis space, H, using information from t and satisfying the
conditions set out in the background knowledge. L is to build an h that is equivalent to c.
Ideally, h is to be exactly the same as c or h is the exact representation of c. Due to the
incompleteness (i.e. teacher usually does not have complete information regarding c) of t
received, h is usually taken to be equivalent to c to some extent expressed in the background
knowledge. In both cases, learning relies on information contained in t and given by the
teacher.

        L:TH           where T : sets of t for a sub-class, c, in C. Also called, example
        space.
        L(t) = h (≡R c)       t∈T
                              h∈H
                              c∈C
                              ≡R: the equivalence relation specified by the learning goal
                              used in selecting hypothesis, h, using t. The selected h
                              contains learnt properties or rules of c that are obtained
                              through information from t.




                 C                         T                   H
                                                                           Background
                           teacher                 t                       Knowledge:
                     c                                         h           • set of criteria
                                                   L
                                                                           • learning goal
                                         Example
                                                                           • type of
                                          space
                                                                              representation
                                     h                                        (descriptive
                                     L                                        language for H)

               Class
                    Environment                         Hypothesis space



Figure 2.2: The learning scenario of a learner or learning algorithm, L, with a given
environment.
Chapter 2: Learning                                                                             5




2.2The Issues in Learning

The algorithms used in learning are ‘ways’ of achieving the learning goal under the set of
criteria in the background knowledge. ‘Ways’ here are methods of constructing hypothesis
from information in the set of examples. As shown in Figure 2.2, the learning algorithm, L,
has two distinct phases in the learning process:
         Phase 1: forming hypothesis, h, from set of examples, t.
                 [shown as arrow from T to H in Figure 2.2]
         Phase 2: selecting and justifying h as a finite representation of the unknown subclass,
                  c.
                 [shown as arrow from H to C in Figure 2.2]

The nature (or design) of L, and the feasibility of the learning problem itself is determined by
the following factors:
         1. Example space, T
            Usually considered arbitrary where various kind of information (training sets) can
            be used to describe c.
         2. Classification of training set, t, usually by a teacher or operation carried out on
            the environment with respect to a particular sub-class, c.
            • Noisy examples are considered where the teacher may classify instances
                wrongly
            • Type of examples to be presented (i.e. positive only, negative only or both)
         3. Presentation of t to L
            Whether elements from t is fed into L one-by-one or in a small groups or a whole
            batch of t and whether the elements are presented in any particular order (i.e. in
            lexicographic order or shortest length first)
         4. The size of the t
            Intuitively, a small t is needed in learning by an efficient and ‘intelligent’ learner
            or learning algorithm. In machine learning, the size of t contributes to the
            computational complexity of a learning algorithm, the larger t is, the longer (or
            more complicated) is the computation.
         5. The choice of representation for the hypothesis space, H
            This involves issues of how much information to capture and can be captured by
            a particular choice of representation. A rich descriptive language ideally required
            as representation means a more complex computation and larger resource (i.e.
            memory storage) requirement, whereas a simple form of representation may not
            capture sufficient information to learn.
         6. Selection criteria of a hypothesis, h, and justification as an equivalence of c.

        All of the above except the last factor constitute to a major part in the design of an
algorithm in machine learning, to be exhibited in Phase 1 of L. The last factor and also the
choice of representation for H, are usually vital in Phase 2 of the learning process where
evaluation are carried out by human experts or some known mechanism such as statistical
confirmation or analysis.

         The learner, L, is said to be able to learn a class in the given environment if it can
learn (i.e. by producing a hypothesis that satisfies both the criteria and learning goal in the
background knowledge provided a priori) any subclass chosen from the class.
Chapter 2: Learning                                                                              6




2.3Learning Finite Automata (FA)

This report investigates the learning process in a particular environment setting (Figure 2.3):
- Teacher: as the source of example space, T, where description to unknown subclass, c, is
    of the form of labelled strings.
- Learner: learns by receiving information in a form of labelled string drawn from T
    following rules set out in the environment constraints.
- c: the unknown regular language or FA

Two almost similar environments for learning are shown in Figure 2.3 with difference in the
class contents. The first environment (Figure 2.3(a)) consists of:
- C1: the class of all languages,
- H1: the hypothesis space, H1, is finite automaton (FAs) as the finite representation for
    regular languages (i.e. a subclass of C1).
- T: the examples are labelled sets of languages.
- Criteria: FA accepts all examples (i.e. strings which may or may not be only positive
    strings) received from training set, t.
- Goal: to produce an FA (i.e. the selected hypothesis) that is equivalent to (i.e. that
    accepts) c.

The other learning environment (Figure 2.3(b)) can be obtained by refining C1 to be the class
of regular languages only and the hypothesis space, H2, are minimum deterministic finite
automata (DFA). The new environment shown in Figure 2.3(b) has more constraints added
into the environment as the teacher is to provide descriptions using only regular languages as
compared to C1, where the teacher is able to provide descriptions using other languages as
well (i.e. context free languages).

This report concerns with the learnability of finite automaton (FA) using minimum DFAs as
the hypothesis space. Both environments, with C1 and C2 as classes, use the same set of
examples, T, which is a set of strings and the training set, t, is a set of classified strings with
respect to a particular sub-class of languages, c. For consistency, throughout the report, the
alphabet, A, will for FAs will be set to the binary set {0,1}.
           C1           T                H1                   C2              T               H2
      (class of all               (class of FAs )      (class of regular                   (class of
      languages)                                         languages ≡                   minimum DFAs)
                                                       minimum DFAs)

         c’                             h                       c                            h

         c




                       (a)                                                   (b)


Figure 2.3: (a) c’ is the sub-class of regular languages, c, and H1 is the class of FAs with
criteria for H1 being deterministic and minimal in size (number of states). (b) c is a particular
subset of regular languages and H2 is the class of minimum DFA itself where no criteria is
needed.
Chapter 2: Learning                                                                                     7




2.4Learning Framework
Given an environment with a class of objects which describe ‘what is to be learnt’, the two
phases, Phase 1 and Phase 2, in the learning process bring us two fundamental questions:
- ‘how do we learn?’
- ‘when do we know we have learnt?’
The former is being dealt with in Phase 1 and the latter, in Phase 2, is studied by Gold [Gold
67] and Valiant [Valiant 84] resulting in two major learning frameworks being proposed, the
identification in the limit by Gold and the probably approximately correct (PAC) learning by
Valiant.



2.4.1Identification in the limit
[Gold 67] states that learning should be a continuous process with the learner (or learning
algorithm), L, having the possibility of changing or refining his guess (i.e. hypothesis) each
time new information from the training set, t, is presented. The learner, L, is only required to
have all his guesses after a finite time to be the same and correct with information seen so far.
Hence, the hypothesis, h, obtained after a finite time will remain the same and correct with
subsequent information. The hypothesis, h, is said to represent the unknown sub-class, c,
described by t in the limit, completing Phase 2 of the learning process. This learning
framework, identification in the limit, consists of three items as formulated by Gold:

        1.   A class of objects
        A class, C, is specified (or given) to learner in the environment where the form of
        communication between the teacher and learner is also specified. An object, c, from
        C will be chosen for the learner to identify.
        [In the context of this report, the unknown object (or sub-class), c, is an FA and the
        class C consists of FAs].

        2.   A method of information presentation
        Information about the unknown chosen object is presented to the learner. The training
        set, t, consists of either positive only, positive and negative, and noisy examples as
        information describing c.
        [t is just a set of labelled strings drawn from example space, T, provided by the
        teacher and the type of t depends on T – all positive strings, all negative strings or
        combination of both]

        3.  A naming relation
        This basically enables the learner to identify the unknown object, c, by specifying its
        name1, h . There is a function, f, for L to map the names to the objects in C. Here, an
        object, c, can have several names (hypotheses) where guesses (or hypotheses are
        made under f).
        [L is to build an FA as the hypothesis, h, for an unknown regular language and h
        could be any of the several DFAs (or TMs) that accepts the unknown regular
        language].




1
  In [Gold 67], name is defined as a Turing Machine (TM). Since the language identified by FA is also
identifiable by a TM, it is sufficient to say that every FA has a TM.
Chapter 2: Learning                                                                          8


2.4.2PAC view
Another learning framework is Probably Approximately Correct (PAC) learning. This was
first proposed by Valiant [Valiant 84] that uses a stochastic setting in the learning process.
The learner is required to build a (approximately correct) hypothesis that has a minimal error
probability after being trained using the training set, t, constituting Phase 1 of the learning
process. Phase 2 under this framework requires the learner to have a high level of confidence
that the hypothesis, h, is approximately correct as a representation of sub-class, c. The
training set, t, is considered ‘good enough’ with high confidence level. This is appropriate
because t generally does not consist of all the positive examples needed to learn c.

         The PAC framework relies on two parameters, accuracy (ε) and confidence limit (δ).
A fixed but unknown distribution is applied over the class of examples, T, where training sets,
t, are drawn at random. Intuitively, PAC learning seems like a passive type of learning with
the learner learning only through observation on given data or information. However,
[Angluin 88] and [Natarajan 91] showed that PAC learning can be used as an active learner
using queries – equivalence, membership, subset, superset, exhaustiveness and disjointness
queries [Angluin 87].

         Given a real number, δ, from 0 to 1 and a real number, ε, also from 0 to 1, there is a
minimum sample length (i.e. the size of training set, t) such that for any unknown subclass, c,
with a fixed but unknown distribution on example space, T, there is:

        a (1- δ)% chance that ε % of the test set will be classified wrongly by hypothesis h,
        where test set is another subset of T different from t to test validity of h.

        PAC learning is desirable for a good approximation to c, as in most cases it is
computationally difficult to build an accurate (exact) hypothesis and [Angluin 88] and
[Natarajan 91] have shown that PAC learning can be easily applied to any other non-
stochastic learning framework.


2.4.3Comparison
Both frameworks have distinct criteria and goal for learning which deal with Phase 2 in the
learning process (Table 2.1). However, they both suggest learning by building tentative
hypotheses from a piece of information in the form of a string from training set, t, (Figure
2.4). Those tentative hypotheses are each a new ‘experience’ (i.e. a modified hypothesis with
slight changes or totally new hypothesis) as new information is received from t. The final
hypothesis, h’, taken to represent sub-class, c, may be totally different from previous
hypotheses.

 Learning framework              Identification in the limit  PAC learning
 Goal                            Same hypothesis (or guess)   P (h – c = ε ) < 1- δ on a
                                 after a finite time for      sufficiently large sample, t.
                                 subsequent           information
                                                              P = the probability function
                                 received.                    h = hypothesis
                                                              c = the unknown sub-class
                                                              δ and ε are parameters needed
 Criteria                        Hypothesis (guess) made must Hypothesis, h, has minimal
                                 be consistent (correct) with error, ε, with respect to T.
                                 information seen so far.
Table 2.1: Comparison between identification in the limit and PAC learning frameworks.
Chapter 2: Learning                                                                              9




                    C                              T                            H

                               teacher                 t ={t1, t2, t3,..}
                    c                                                             h1
                                                             L

                                                                                    h2

                                               h                                  h3
                                               L

          Environment with oracles

Figure 2.4: A learning scenario with learning algorithm, L, making several tentative
hypotheses (i.e. h1, h2, h3) in H from sequence of labelled examples (i.e. t1, t2, t3).


        Recent studies[Kearns et al 94; Rivest et al 88; Porat 91] are done under Gold’s
proposed learning framework, as it is more natural to human learning. We can always change
our perception (hypothesis) each time a new information is received and still being consistent
with the previous information. We never know (or predict) when we finish learning (which is
a perpetual process in humans).



2.4.4Other variations of learning framework
There are two other learning frameworks mentioned by Gold in [Gold 67]:
        1. Finite Identification
        The learner stops the presentation of information after a finite number of examples
        and identifies the sub-class, c. The learner is to know when he has acquired sufficient
        number of examples and therefore able to identify c.

        2.  Fixed-time Identification
        A fixed finite time2 is specified a priori (i.e. usually as background knowledge) and
        independently of the unknown object presented whereby the learner stops learning
        and identifies the unknown object.

These two frameworks seem to ask too much of the learner where the learner is ‘forced’ to
identify the sub-class, c, by outputting a hypothesis, h, after some predicted factor or
condition is achieved. In finite identification, the learner is able to predict the number of
examples needed to learn and stop learning once the predicted number of examples has been
presented. On the other hand, the fixed-time identification requires the learner to know in
some ways ‘when’ he is able to stop learning.

Learning as mentioned earlier, is to identify or distinguish the ambiguous lines separating
each sub-class in a learning environment. Being able to tell when exactly (i.e. able to predict
those lines) to stop learning means that there is no need for learning to start in the first place.




2
  Time is taken throughout the report, to correspond with the computational complexity and the
termination of a successful learning algorithm.
Chapter 2: Learning                                                                                   10


2.5Results on learning finite automata.
The complexity and learnability of finite automaton identification have received extensive
research [Gold 67; Angluin 87; Vazirani et al 88]. The computational complexity is being
considered here with respect to the size of the hypothesis space (minimum DFA) searched
and the size of training set (examples) required.

Other complexity results that have dealt with the computational efficiency are as follows:

         1.    Identification in the limit and learnability model, [Gold 67] – Gold classifies the
               classes of languages that are learnable in the limit into three categories of
               information presentation (Table 2.2). Learning from positive only examples is
               proven to be NP-complete.
         2.    Inferring consistent DFA or NFA, of the size factor (1+1/8) of the minimum
               consistent DFA is NP-complete given positive and negatives examples. [Li et al
               87]
         3.    There is an efficient learning algorithm to find minimum DFA consistent with
               given positive and negative data and access to membership and equivalence
               queries [Angluin 87], using observation table as a representation of FA.
         4.    Learning FA by experimentation3 (as in 4 above) [Kearns et al 94], using
               classification tree as a representation for FA in polynomial time.
         5.    State characterisation and Data Matrix Agreement is introduced for the problem
               of automaton identification [Gold 78]
         6.    Inferring minimum DFA’s and regular sets using positive and negative examples
               only is NP-complete. [Gold 67,78] and [Angluin 78]

              Learnability model                          Class of languages
              Anomalous Text                              - Recursive enumerable
                                                          - Recursive
              Informant                                   - Primitive recursive
              (using positive and negative                - Context sensitive
              examples/instances)                         - Context free
                                                          - Regular
                                                          - Superfinite
              Text                                        Finite cardinality
              (using positive only
              examples/instances)
         Table 2.2: Learnability and non-learnability of languages [Gold 67] where
         superfinite language is the class of all finite languages and one infinite regular
         language.

        These results shows that inferring DFA directly from just examples are NP-hard and
some other learning methods are employed in successfully learning FA. The methods used in
successfully learning FA is surveyed in the following chapters.




3
 Experimentation – a form of learning where learner is able to experiment with chosen strings (i.e.
selected by learner and not from training set provided) during training.
Chapter 3: Non-Probabilistic Learning                                                            11




3.Non-Probabilistic Learning for FA

In building a hypothesis, h, for an unknown FA, c, the learning algorithm, L, usually
receives information (i.e. labelled strings) describing c from a training set, t. L is to build an h
that is equivalent to c with the information it received insofar. Ideally, h is to be exactly the
same as c. However in practical, as c is unknown, the teacher usually may not have complete
information required to build the exact FA and h is then taken to be an approximation to c to
some extent to be specified in the background knowledge (i.e. approximately equivalent to c
or probably approximately correct h than the usual exact h).

        Learning relies on L to make several guesses based on information provided by the
teacher in the following ‘ways’ to be discussed in this chapter:
        a) learning with queries, section 3.1
        b) learning without queries, section 3.2
        c) learning with homing sequences, section 3.3

        L is to make guesses about c through a number of tentative hypotheses (i.e. tentative
FAs), M’, from the information received. Each guess is a refinement or modification to the
previous guess (hypothesis) where new properties of FA (i.e. the characteristic and elements
of FA) are discovered. The guess made by L is also called a conjecture. The learner will
produce several conjectures until the learning goal is achieved, that is, a final conjecture is
accepted as the equivalent FA to c.

        All information received and properties learnt through the modifications are kept in a
data structure. The modification to the data structure is called an update and a new
hypothesis is built based on the updated data structure. Hence, the data structure has several
roles:

        a) a representation of properties (to be learnt) of an FA :
               • the finite number of states
               • transitions (representing the transition function)
               • the set of distinguishing strings
               • the accepting and rejecting states

        b) a record of modifications made (i.e. updates)
               • incorporating more information received: strings in t
               • updating more properties learnt

        c)   a reference to build next tentative FA, M’, after each update

The data structures being used in this chapter by the learner are briefly explained below,
detailed explanation on the updates are given in the relevant section in brackets:

        1.   observation table (see section 3.1.1)
             A two dimensional table, in Figure 3.1, where the rows correspond to the states
             and the columns correspond to the set of distinguishing strings for FA. The
             entries in the table are values of ‘0’ and ‘1’ corresponding to the transition
             function of FA to a rejecting and accepting states respectively.
Chapter 3: Non-Probabilistic Learning                                                                           12




                                   e1                e2     …                  Distinguishing strings = {e1 , e2,…}
         s1                         1                0                         States = {s1, s2, …}
         s2                         0                0                         transition function, δ(q,x)
         s3                                                                              = 0 (= qx is a rejecting state)
              :                                                                          = 1 (= qx is an accepting state)
              :                                                                          for some string x from state q

Figure 3.5: Observation table representing elements of FA: states (rows), distinguishing
strings (columns) and transition function (table entries in shaded section)


        2. classification tree (see section 3.1.2)
                  A binary classification tree where the leaves correspond to the states in FA and
                  the distinguishing strings are represented by the internal nodes (and root) of the
                  tree, as shown in Figure 3.2. The left and right paths from an internal node
                  correspond transition function of FA to a rejecting and accepting states




                                                          States = {s1, s2, s3, …}
                            Root (d1)                     Distinguishing strings = {d1, d2, …}
                                                          transition function, δ(q,x)
                              d2                                    = left path (= qx is a rejecting state)
                  respectively.                                     = right path (= qx is an accepting state)
                                          s1                        for some string x from state q
                     d3
Figure 3.6: Classification tree representing elements of an FA: states (leaves), distinguishing
strings (internal nodes including root) and transition function (the right and left paths).
                         :          s2

        3.        minword(q) (see section 3.2.1)
                  A string used to reach a state q in an FA from initial state q0. Thus, the set of
                  minword(q) corresponds to the states in an FA as shown in Figure 3.3.


                       λ      q0
                                            0
                                                q1                                              0,1
                                        0
                                                     1
                           1 1                  0                        λ               0
             (a)                                                (b)
                                        0
                            q2                  q3                                   1
                                                                               q0             q1
                                        0
                  minword(q0) = λ
                  minword(q1) = 0                                            minword(q0) = λ
                  minword(q2) = 1                                            minword(q1) = 0
                  minword(q3) = 01

Figure 3.7: The set of minword(q) for FAs. (a) four minword(q) representing the states in the
FA that accepts all strings with even 0’s and 1’s. (b) two minword(q) representing the states
in the FA that accept all non-empty strings.
Chapter 3: Non-Probabilistic Learning                                                        13




3.1Learning with queries

Additional information regarding the unknown c can be requested by L by asking queries
[Angluin 88]. The queries can be equivalence query, membership query, subset query,
superset query, disjointness query and exhaustiveness query. Two of the six queries are used
in the following two algorithms, L1 and L2, (see section 3.1.1 and 3.12) in learning c:

        1.   Membership queries
             The teacher returns a yes/no answer when the learner presents an input string, x,
             of its choice in the query, depending upon whether x is accepted by the unknown
             FA, c.

        2.   Equivalence queries
             The teacher returns a yes answer if the conjecture, M’, is equivalent to c and
             otherwise returns a counterexample, y, which is a string in the symmetric
             difference of M’ and c, if M’ is not equivalent to c.

         Hence, L has access to some oracle (could be the teacher or from some operations
available in the environment), creating an active interaction between the learner and teacher
in the learning process (Figure 3.8). Both queries require a pair of oracles where each oracle
is used in separate stage of learning:
        a) Phase 1 of learning: updating the data structure used to construct the conjecture,
           M’
        b) Phase 2 of learning: to confirm M’ as a finite representation of c (i.e. when to stop
           learning)


                 C                            T                          H

                           teacher                       t
                     c                                                       h
                                                             L

                                                    h
                                                    L
                               Oracle(s)


                 Environment

Figure 3.8: Learning with additional information obtained through access to oracle in the
environment.
Chapter 3: Non-Probabilistic Learning                                                                    14




3.1.1L1: by Dana Angluin [Angluin 87]
The observation table (e.g. Figure 3.5) is the data structure used to stores the information and
learnt properties about the unknown FA, c. All rows and columns are represented by strings
based on information from the training set t and the set of distinguishing strings learnt
respectively.

         Each row is viewed as a vector with attributes values of ‘0’ and ‘1’ (i.e. the ‘0’ and
‘1’ table entries in each row corresponding to each column) representing a state in c. Thus,
the string representing each row also represents a state in c. A string is said to represent a
state q when it can be used to reach q from the initial state q0. The vectors are used to
distinguish the rows, thus, distinguishing the states in c.

        Alternatively, each row can be viewed as a set of distinguishing strings e where each
e represents a column in the table. The table entries of ‘1’ and ‘0’ in a row depends on
whether e (for the corresponding column) is the distinguishing string or not to the row (state
represented) respectively.

         There may be rows with the same vector (i.e. with the same set of distinguishing
strings) and by the Myhill-Nerode Theorem of equivalence classes, these rows are said to be
equivalent to each other, that is, representing the same equivalence class x. Thus, we use the
alternative view of a row above in referring to the distinct states represented by these rows.
The distinct state, that is, the equivalence class x, is represented by the distinct row vector.

         From Figure 3.5, there are only two distinct rows, s1 and s2, with vectors ‘0’ and ‘1’
and strings λ and 0 respectively. The rest of the rows have the same vector ‘1’ as row s2. Thus,
there are only two distinct states represented by a set of strings {λ} and {0,1,00,01} each. The
sets of distinguishing strings are φ and {λ} for the two distinct states respectively.

                                                  e1 = λ
                                                                                        Rows: s1…s5
                                  s1 =    λ          0                                  Columns: e1
                                  s2 =    0          1                                  training set t= {-λ, +0, +1, +00, +01}
                                  s3 =    1          1                                  distinguishing strings = {λ}
                                  s4 =    00         1                                  States: s1, s2
                                  s5 =    01         1
Figure 3.9: Observation table with five rows representing two distinct states with string from
t.

        We now specify the three main elements in the observation table O, as shown in
Figure 3.6, used by the learner L1 during learning to represent properties and information of
c:

           1.   A non-empty prefix-closed* set of strings, S.
                This set starts with the null string, λ. All the rows in the observation table are
                each represented by strings in S∪S.A.
                There are two distinct divisions of rows in O: the upper division (i.e. as shown by
                the shaded rows in Figure 3.6) of the table is represented by the strings in S and
                the lower division is represented by strings in S.A. Each row in the upper division
                is the particular state reachable through some s∈S from the initial state q0. The

*
    A prefix-closed set is where every prefix of each member is also an element of the set.
Chapter 3: Non-Probabilistic Learning                                                             15


                rows in the lower division of O therefore represent the next-states reached
                through transitions a∈A from rows in the upper division.
                Thus, S represents the states discovered (learnt) by learner in the course of
                learning.

           2.   A non-empty suffix-closed** set of string, E.
                This set also starts with an initial null string, λ. The columns in the observation
                table
                are represented by the strings in this set.
                The vector for each row is a collection of strings from E. Thus the distinct subsets
                of E represented by the distinct row vectors are used to identify the distinct states
                represented by the strings in S ∪ S.A. From Figure 3.6, each φ, {λ} ⊆ E
                (represented by vectors ‘0’ and ‘1’ respectively) identifies the two distinct states
                represented by {λ} and {0,1,00,01} in S ∪ S.A.
                Thus, E represents the characteristics of states learnt through subsets of strings in
                E.

           3.   A mapping function, T: (S ∪ S.A).E  {0,1} where
                T(x.e) = ‘1’ if the string x.e ∈ c and ‘0’ otherwise with x ∈ S ∪ S.A .
                Thus, this mapping function represents the transition function of FA, δ(q0,xe).
                                                    E
                                                    λ     Rows: s1…s5
                                  s1 =    λ         0 S S = {λ, 0}
                                  s2 =    0         1     E = {λ}
                                  s3 =    1         1     S∪S.A = {λ, 0, 1, 00, 01}
                                  s4 =    00        1 S.A table entries: T(x.e)
                                  s5 =    01        1             where x∈ S ∪ S.A , e∈E

Figure 3.10: Observation table O with upper division (shaded section) and lower division of
rows from the set S∪S.A.

        Each of the following two properties of O, closed and consistent, are used by L1 as a
guide to carry out updates (i.e. the extension of rows and columns) during learning:

           a) closed
              As the lower division in O are next-states of previous states in the upper division
              on taking transitions on symbols in A, the row vectors in the lower division must
              therefore also exist in the upper division , the closed property of O.
              Thus, for every string, s’, in S.A, there is an s in S where both strings, s’ and s,
              have the same vector. As shown in Figure 3.6, the vectors in the lower division of
              O existed in the upper division where all next-states are existing states.

           b) consistent
              Each pair of vectors with the same subset of distinguishing strings should
              represent the same state. The next-state vectors from this pair of vectors should
              be the same vector representing the same next-state reached, the consistent
              property of O.
              Thus, any pair of strings, s1, s2 in S, with the property of row(s1)=row(s2), then
              row(s1.a) = row(s2.a) for all a in A. As shown in Figure 3.7, the rows represented
              by strings λ and 11is to be consistent when both strings is representing the same


**
     A suffix-closed set is where every suffix of each member is also an element of the set.
Chapter 3: Non-Probabilistic Learning                                                           16


            distinct state represented by vector (10) moves into the same next-state(s)
            represented having the same vectors.

                                       λ        0
                          λ            1        0       previous state
                          0            0        1
                          1            0        0       next-state
                          11           1        0       previous state
                          00           1        0
                          01           0        0
                          10           0        0
                          110          0        1
                          111          0        0       next-state
Figure 3.11: Observation table which is consistent, where two rows λ and 11 represented by
the same row vector (1 0) has the same row vector (00) representing the same next-state
reached from both rows in the upper division (shaded region).

         The observation table O is updated by extending the rows and columns (discovering
more states and the characteristics of each state) using membership queries and equivalence
queries, as shown in Figure 3.8 and Figure 3.9. An update is carried out in two circumstances:

                                                T1     λ
                                                                  S ∪ S.A = {λ, 0, 1}
                                                λ      0
                                                                  T(λ.λ) = 0
                              row vector not in 0      1          T(0.λ) = 1
                              upper division                     T(1.λ) = 1
                                                1      1
                                                           make closed: S ∪ {0}
                                                T2 λ
                                                λ (a)
                                                    0
                                                                  S = {λ, 0}
                 (b)          newly added row 0     1             E = {λ}
                              with vector (0)                    S ∪ S.A = {λ, 0, 1, 00, 01}
                                                1   1
                                                00 1
                                                01 1


Figure 2.12: (a) Observation table T1 not closed with vector (0). (b) a closed T2 with extension
to T1 adding new row into table representing new state discovered.

        a) when either one of the close and consistent properties of O does not hold:

                •      O is not closed when a vector is not represented in the upper division. A
                       new state is said to be discovered (learnt) as it is a non-existing next-
                       state. From Figure 3.8(a) the row with vector (0) in the lower division is
                       not represented in the upper division, indicating that the next-state is not
                       an existing state.
                       Then O is updated by
                            S ∪ {s’} where s’ ∈ S.A
                       Thus, Figure 3.8(b) shows the updated O with new string (row) in S and
                       new row in the upper division representing the new state learnt.
Chapter 3: Non-Probabilistic Learning                                                             17


                      [Adding s’ into S still maintains the prefix-closed property of the set as s’
                      is an element of S being appended with an input letter from the alphabet.]

             Note: Membership queries are used to complete the table entries whenever E or S
             is extended. The queries are made on strings in the (S ∪ S.A).E where a yes
             answer from the teacher means a ‘1’ entry in O and vice versa.

                 •    O is inconsistent when two rows with the same vector have a pair of
                      different next-state vectors. This indicates that one of the pair of strings s,
                      s’ in S actually represent a different (newly discovered) state not in the
                      existing states (rows). As in Figure 3.9(a), pair of rows with same vector
                      lead to different next-state on transition ‘1’ in O1.
                      Then O is updated by
                           E ∪ {a.e} where a is the transition symbol which brought the two
                                states to a different next-state and e is the element in E where the
                                next-state vector differs (i.e. at one of the attributes).
                      Thus, Figure 3.9(b) shows the updated O2 with an extra column
                      represented by string ‘1’, the transition symbol which brought the pair of
                      rows to different rows. The e element previous E where the difference is
                      seen is λ. All the table entries resulting from this additional column are
                      filled in using membership queries on the new (S ∪ S.A).E.
                      [The suffix-closed property of E is also maintained with ‘a.e’ added to E,
                      where e is the previous suffix element being added to the set before
                      ‘a.e’.]

               O1           λ       0                        O2        λ      0      0
               λ            1       0                        λ         1      0      0
               0            0       1    same:               0         0      1      0
               01           0       0     current state     01        0      0      0
               010          0       0     current state     010       0      0      1                   new row
               1            0       0                        1         0      0      1                  (new vector)
               00           1       0    different:          00        1      0      0
               011          0       1     next-state        011       0      1      0
               0100         0       0                        0100      0      0      0
               0101         1       0     next-state        0101      1      0      0

  The e column which the rows            Make consistent
  differs at shaded entries, ‘0’          E∪{1.λ}
  and ‘1’

Figure 3.13: (a) O1 is inconsistent with different next-state vector for a pair of rows with same
vectors representing same state. (b) An updated O2 obtained with newly learnt state
represented by the new row with new vector in the upper division of new table.

        c)   when a counterexample y is returned from an equivalence query
        S is extended during learning to include all the prefixes of y. Thus, the upper division
        of table is extended with new strings and membership queries are used to fill in all
        new entries.

        We now have the questions of “when to built a tentative M using the data
structure?” and “how a tentative M’ is built from data structure?”. The tentative M’ in
Figure 3.10(a) is built only when the observation table O has both the properties of closed and
consistent, as in (a) where all upper rows with the same vectors leads to a pair of rows with
the same vector and all vectors in the lower division are represented in the upper division.
Chapter 3: Non-Probabilistic Learning                                                              18




        The latter question is answered with a closed and consistent O. This closed and
consistent O is used to build a tentative deterministic FA (DFA), M’, with each distinct
vectors (i.e. distinct rows) in the upper division representing a state in M’. Then, the M’ is
completed by having transitions on all symbols in A from every states. The next-state here is
determined by a look-up at the rows represented by the string s.a (i.e. the resulting string from
taking a transition a from row s) in the table for the corresponding vector to each strings.

        From Figure 3.10(b), the conjecture M’ is built from the closed and consistent
observation table O, in Figure 3.10(a). The states in M’ are the distinct vectors in the upper
division, which are each shared among strings representing all the rows in O. M’ is the
minimum DFA that accepts all non-empty strings as equivalence query on M’ returns a yes
answer.



        O       Col1 (= λ)                                                     s2 = {0, 1,00,01}
    s1 = λ       T(λ) = 0        Figure 3.14: (a)        s1 = {λ}
                                    distinct vector
    s2 = 0          1            The distinct vector
                                    observation                         0,1
    s3 = 1          1            table, the final the
                                   (also O, for state)
    s4 = 00         1            unknown FA, c, λ
    s5 = 01         1            recognising the
                                 set of all non-        M’                         0,1
empty strings. The rows are elements of S ∪
S.A and columns are elements from E. (b) The                             (b)
conjecture, M’, is constructed using the closed
        E = {λ}
and consistent O. The final state being the row having vector (1) (bold arrow). λ is always the
        S = {λ, 0}
initial state being the first row in the table, a non-accepting state in this case. The next-state
                          (a)
transitions are the strings {0,1,00,01}.


        Each conjecture, M’, is then presented to the teacher in the form of an equivalence
query. At this point, if the guess is correct, no counterexample is returned and M’ is the
minimum DFA equivalent to c, as in Figure 3.10(b) where an equivalence query on M’
returns a yes. Thus L1 stops learning and output M’ as its hypothesis. The conjecture M’ is a
minimum DFA representing the unknown FA.

         However, if a counterexample is returned, an update is carried out to the observation
table (i.e. adding all prefixes to S) and another update if the updated table with additional
prefixes is not closed and/or inconsistent. Next conjecture is built when both properties are
satisfied. Membership queries are used to fill in new entries for the new rows obtained from
the extended S where counterexample and its prefixes are the learner’s choice in presenting
membership queries.

        This minimality on the number of states in the conjectured DFAs is maintained by the
closed and consistent property of the observation table. Through the consistent and closed test
on every updated table, two rows that have the same vector is considered as belonging to the
same equivalent class by Myhill-Nerode Theorem (i.e. class x with the same behaviour for a
set of distinguishing strings). Thus, building a conjecture only if a closed and consistent
observation table is obtained after every update and taking only the distinct vectors as
representing the distinct states in a building DFA always results in a minimum DFA.

         “How to start learning?”. This question brings us to the important role of the null
string λ, which both S and E starts with as the first element. This string not only brings us to
Chapter 3: Non-Probabilistic Learning                                                            19


discovering the initial state q0 (being the first row in the table) but also as the distinguishing
string which is uses to decide which of the distinct vectors are accepting or rejecting states.
Being the first element of E therefore allows every string in all rows to be queried by the
learner whether it is accepted or rejected by c in the membership queries. Thus, a row which
has the λ as its set of distinguishing strings indicate that the vector with ‘1’ entry in the λ
column must represents an accepting state as the string represented by that vector is accepted.

         From Figure 3.10(a), the vector (1) for row with string ‘0’ represents an accepting
state as ‘0’ is accepted at column represented by λ, which is also in the set of distinguishing
strings for row ‘0’ indicated by the ‘1’ entry. Learning process thus starts with S and E having
only one element (i.e. the null string) and the initial table with only a column and three rows
(one row for λ in the upper division and the other 2 rows for the next-state rows in the lower
division).

        Another illustration in is shown in , with learner trying to learn the FA that accepts all
string with even 0’s and 1’s. The initial table is constructed as Oo which is not closed. L1
updates the table until an equivalence query initiate a termination by a yes answer for
conjecture M1 after five updates (i.e. five observation tables) and two conjectures.

         The examples required by the learner are obtained through membership queries and
counterexamples both from the training set t consisting of positive and negative examples (i.e.
the ‘0’ and ‘1’ entries accepted and rejected strings).




   O0        λ                          O1     λ                O1
                                                                                             1
   λ         1       make close        λ      1     S={λ,0}                  1,0
   0         0       S∪{0}              0      0     E={λ}      λ             0
   1         0                          1      0
   S={λ}                                00     1                M0 : Equivalence query  no (y=010)
   E={λ}                                01     0

   O2        λ                          O3     λ     0                   O4     λ   0    1
   λ         1                          λ      1     0                   λ      1   0    0
   0         0       make               0      0     1    make           0      0   1    0
   01        0       consistent        01     0     0    consistent    01     0   0    0
   010       0       E={λ}∪{0.λ)        010    0     0    E∪{1.λ}        010    0   0    1
   1         0                          1      0     0                   1      0   0    1
   00        1                          00     1     0                   00     1   0    0
   011       0                          011    0     1                   011    0   1    0
   0100      0                          0100   0     0                   0100   0   0    0
   0101      1                          0101   1     0                   0101   1   0    0

O4 is closed and consistent
S = {λ, 0, 01,0 10}
E = {λ, 0, 1}               λ                   0
                                    0
                    λ                                          Equivalence query  yes
             M1
                                    0
                              1 1              1 1
                                    0
                        010                          01
                                    0
Chapter 3: Non-Probabilistic Learning                                                      20


Figure 3.15: Running examples of learning the unknown FA that accepts the set of all strings
with even number of 0’s and 1’s.
Chapter 3: Non-Probabilistic Learning                                                          21




3.1.2L2: by Kearns and Vazirani [Kearns et al 94]
This algorithm uses the same principles as L1 (i.e. using membership and equivalence queries
and positive and negative examples) but the data structure used to construct the tentative FA,
M’, is a classification tree, as shown in Figure 3.19. The leaves of the classification tree
represent the states learnt (known) in c and the nodes represent the distinguishing strings
required to distinguish (discover) the states in c. All the nodes and leaves are represented by a
string each, based on the information received from counterexamples and membership queries
on chosen strings.

                                  Root (d1= λ)
                   T1
                                                      Nodes : d1, d2, d3
                                                      Leaves: s1, s2, s3, s4
          d2 = 0                                      States : s1, s2, s3, s4
                                  s1 = λ              Distinguishing strings = {λ, 0, 1}
                                                      training set t = {+λ, -0, -1, -01}
     d3 = 1
                         s2 = 0


     s4 = 01       s3 = 1


Figure 3.16: Classification tree, T1, with 3 node representing 3 distinguishing strings and 4
leaves each representing an equivalence class.

         The Myhill-Nerode Theorem is also adopted by L2, that is, maintaining the set of
distinguishing strings that distinguishes between the equivalent states represented as leaves in
the tree. The leaves can be viewed as the equivalence class x containing a set of strings
(representative strings) having the same behaviour (distinguishability) with respect to c and
the set of distinguishing strings. Thus, each node is seen as the distinguishing string between
the children on its right subtree and left subtree (i.e. the leaves in the subtrees) to accepting
strings and rejecting string respectively. In Figure 3.12, the node d3, represented by string 1,
distinguishes between the leaves, 01 and 1, in its right and left subtree respectively with
respect to the FA that accepts all strings with even 0’s and 1’s.

         The next-state which a string x reached with transition symbol a is determined by
traversing the tree with the string xa starting from the root until a leaf s is reached. At each
node d visited, the next path to take depends on whether the string xad is accepted or rejected
by c. The right path is taken if xad is accepted by the unknown FA and left path if otherwise.
The leaf s reached is the equivalence class where x belongs. Thus, xa is said to represent the
state represented by s. The membership queries are used here to query which path to take with
xad being the string of the learner’s choice.

        As in Figure 3.12, the string 01 when traversed through the tree landed up in leaf s4
where the string 011 is rejected at node d3 in T1 by taking the left path to leaf s4. However, in
T2 from Figure 3.13, it is accepted as the traversal reached a right leaf s1 from the root d1.
Thus, the string 01 is said to represent state s4 and s1 in T1 and T2 respectively with respect to
the FAs being learnt.
Figure 3.17: Classification tree, T2 with one node and two leaves represented by the strings in

                                     Root        Nodes : d1
                            T2       (d1 = λ)    Leaves: s1, s2
                                                 States : s1, s2
                                                 Distinguishing strings = {λ}
                                                 training set t = {-λ, +01}
                        s2 = λ      s1 = 01
Chapter 3: Non-Probabilistic Learning                                                          22




         The classification tree, T, maintains two main elements to represent the properties
learnt of c and also the information received from the training set. The elements are specified
as the following:

        1.   a set of access strings, S
             The initial set contains only one string, the null string λ. The leaves in T are each
             represented by strings in S. All the leaves then represent the known states
             discovered insofar of the unknown FA. The leaves in left subtree of root contains
             all s in S that are rejected by c and the leaves in the right subtree of root are
             strings that are accepted by c. Thus, S is subdivided into 2 subsets of accepting
             and rejecting states (i.e. the leaves).

             From Figure 3.13, S is the set of strings representing the leaves and the 2 subsets
             for T2 are {λ} and {01} where both sets are the positive and negative set of
             examples from t.

        2.   a suffix-closed set of distinguishing strings, D
             The initial D starts with the null string, λ, and is used to distinguish each pair of
             access strings in S. The strings in D represent the nodes of the T. Each node, d,
             has exactly two children, distinguishing a pair of strings in S such that the right
             subtree consists of strings s.d that are accepted by the unknown FA at node d and
             vice versa.
                (a)    t = {+λ}
       +
             As in Figure 3.13, the root is the distinguishing string for the leaves in the right
             and left subtree from the root where strings λ and 01 representing the leaves are
             also representing the rejecting and accepting states respectively.

          The classification tree is used to build every conjecture M’ except the initial
conjecture (i.e. the learner’s first guess) of FA, M0. There are only two different initial
conjectures to choose from as M0, as shown by the initial conjectures in Figure 3.14(a) and
Figure 3.14(b) which has only the single start state with all transition to itself. This initial
guess depends on a membership query on the null string λ. Thus, M0 either accepts or rejects
the set of all strings depending on whether the λ is accepted by the unknown M, that is, the
initial state is an accepting state if λ is accepted by M and vice versa.

                                                                 T0
                                           0
                                       M
                                  +λ
                (a)                                 0,1

                                               λ                      y           +λ


                                       M0
                (b)                                 0,1
                                  -λ                             T0

                                                                      -λ           y
                                               λ
              (b)     t = {-λ}
Figure 3.18: (a) M0 accepting all strings as t provide the positive example, +λ. (b) M0 accepts
the empty set where t provide the negative example, -λ. The corresponding tree T0 is an
incomplete tree to be completed with the counterexample returned after an equivalence query
on M0.
Chapter 3: Non-Probabilistic Learning                                                             23




         As in L1, for every conjecture M’ produced, an equivalence query on M’ is presented
by the learner. L2 terminates if no counterexample is returned. Thus, the first guess from the
learner is whether all strings are accepted or rejected by the unknown M. The first
counterexample represents the remaining unrepresented leaf y in the incomplete T0 for the
initial conjectures, as in Figure 3.19, where y is the one of the leaves for each tree. The
subsequent counterexample y returned is analysed using the divergence concept (see below)
and a current classification tree, T.

         From Figure 3.15(a), the unknown FA accepting all non-empty strings results in the
initial M0 being the DFA accepting the empty set. An equivalence query to M0 returns the
counterexample string +01. As λ is rejected at the root, the first classification tree, T0, as
shown in Figure 3.19(b), has λ as its left child and the counterexample 01 as its right child.

  M0              λ

       λ

                      0,1            T0             Root (≡ λ)       M’
S = {λ}                                                                       λ             01
D = {λ}                                                                              0,1
                                                                       λ

 Equivalence query  (y=01)                λ            y=01         S = {λ, 01}                 0,1
            (a)                                                      D = {λ}
                                          S = {λ, 01}
                                          D = {λ}
                                          counterexample, y = 01       Equivalence query  yes
                                                    (b)                              (c)
Figure 3.19: An unknown FA that accepts the set of all non-empty strings is being learnt; (a)
                             0                                0
The initial conjecture, M ; (b) The classification tree, T , with 2 leaves from S and a node
(root) from D; (c) The conjecture, M’, is constructed using the classification tree in (b) with
the first leaf, λ as the initial state and the second leaf, represented by string, 01, as the other
state in M’. As 01 is a leaf on the right subtree from root, the final state is represented by the
leaf, 01.
Chapter 3: Non-Probabilistic Learning                                                            24




         In Figure 3.15(b), T0 is then used to build a tentative deterministic FA (DFA), M’
(Figure 3.19(c)), using the leaves λ and 01 to represent the states of M’ where all states in M’
are labelled with the leaves of T0. The next-state transitions for every states in M’ is done by
traversing T0 with the string representing each state appended with a transition symbol a. The
strings used for the next-state transition are {011,010, 0, 1}. The next equivalence query
returns a ‘yes’ which terminates learning, as in Figure 3.15(c).

        From every counterexample y, each prefix of y is analysed to determine the prefix yi
that leads to different states when it is tested on both the current T and the conjecture M’
which returns y in the equivalence query on M’. Both tests will result in a pair of states: a leaf
and a state from T and M’ respectively. Since M’ is built by taking all the leaves in T to
represent the states in conjecture, then the pair of states from the tests should point to the
same state for a string if T and M’ are equivalent. Thus, there must be a node and transition
symbol (path) that yi takes leading to first different pair of states. This is called the divergence
point.

        Thus, a counterexample indicates that somewhere along the string, y = y1…yn for n
input symbols in y, at one of the prefix, ym, M’ and T diverge into a different path leading to
different state. The divergence point is
                          ym-1   (i.e. the immediate prefix before ym where pair of different
                                 states sM and sT is obtained from M and T respectively in the
                                 test.
                          where ym is the prefix where divergence occurs

The current tree, T, is used to trace the common ancestor for both sM and sT, that is, the node
d, that distinguishes the leaves represented by sM and sT . Both d and ym-1 are used to update
the classification tree. Figure 3.16 shows how the divergence point is found from
counterexample y.


                          λ        T’                    M’               0,1
                                                              λ                          0,1
                     0
                                   λ                                      0         01
                          0                                       λ             1
                                                                      0
                01
                                       y = 11                                  0
                                       Prefixes y1: {1, 11}   y = 11  Equivalence query (M’)
                                       where i = 1, 2
                                       y1 = 1  sM = 01
         Divergence point = y1                   sT = 01
         common ancestor d = 0         y2 = 11 sM = 01
                                                  sT = λ

Figure 3.20: The unknown FA accepts all string with even 0’s and 1’s. The conjectured M’ is
returned with counterexample y = 11 in equivalence query. Each prefixes of y is traversed in
T’ and also M’ to find the divergence point, y1 = 1 where divergence occurs at y2. The
common ancestor for 01 and λ where y2 diverged to is 0 from T’.
Chapter 3: Non-Probabilistic Learning                                                              25



        As the learner learns more information and properties of c, the new information and
properties (regarding new states) are updated in the tree by extending the nodes and leaves.
These updates are carried out only using the information from equivalence query, that is, the
counterexample. The counterexample is then analysed for divergence point and the results
from the divergence analysis are the common ancestor d and the prefix ym-1, representing the
divergence point. Therefore, the tree is updated using d and ym-1 as follows:

        a)    new access string, ym-1 (i.e. a prefix of y), to add to S
              a new state represented by ym-1 is discovered with a new leaf is to be updated onto
              T as S is extended.
              [S is extended to include the prefix of a counterexample representing a known
              state of c discovered as shown in Figure 3.17 by the shaded leaf 1]

        b) new distinguishing string, a.d, to add to D
              where d: the most common ancestor for sT and sM
                             a: the input letter that leads ym-1 to sT and sM (i.e. ym = ym-1.a)
                         ym = y1…ym (the prefix of y)
                         ym = mth symbol in y
                         sT and sM : states reached in T and M

             [D is extended when the counterexample is returned and a prefix (i.e. to be
             included in the extension of S) of the counterexample is identified as the point of
             divergence. D is to include a new distinguishing the string, a.d, where d is the
             common ancestor node and ‘a’ is the input letter leading from divergence point
             the two different strings reached. As shown in Figure 3.17 where the new node is
             shaded.]

         The new extensions involve sT being replaced by the new string a.d forming a new
internal node. The leaves, sT and ym-1, are the children of the new node and their position
depends on the acceptance of each concatenated with a.d through membership queries. The
suffix-closed property of D maintains reachability from other states to the final state(s) each
time a new state (leaf) is discovered (i.e. added into S). Figure 3.17 shows how the tree T’ in
Figure 3.16 is updated and used to build a new conjecture M”.

                                                                 λ                      0
                                                             λ                 0
                                λ
              T”
                        0                                                  0
                                        λ                                               1
                                                                     1 1           1
                   1                0       New conjecture
                                                                           0
                   01       1                                              0
                                                                      1            01




Figure 3.21: The new updated tree from Figure 3.16, T”, has new node and new leaf (in
shades) being added when a divergence point is found in previous counterexample. New
conjecture M” is queried for equivalence query and no counterexample returned.
Chapter 3: Non-Probabilistic Learning                                                                                         26


        We illustrate another example of L2 learning the FA that accepts the strings with even
0’s and 1’s in Figure 3.18 below. The divergence point in done and represented by ym.

                   Conjecture                              Equivalence                  Classification tree
                                                             query
                                                                                                      λ
M0         λ                                                no          T0
                                         0,1               y = 01        S={λ,01}
                                                                         D={λ)                    01              λ
                               λ
                                                                                                              λ
M1                             0,1                   0,1    no          T1
       λ                                                                                          0
                                                           y = 00        S={λ,01,0}                                           λ
                       λ                 01                ym = 00       D={λ,0}                              0
                                                                                             01
                                                                                                                      λ
                               0,1
M2     λ                                             0,1    no          T2                               0
                               0             01            y = 11        S={λ,01,0,1}                                             λ
                   λ                 1                     ym = 11       D={λ,0,1}
                           0                                                                      1                       0
                                     0
                                                                                                  01          1
                   λ                         1
M3                             1                            yes
       λ
                               1
                       0                         0
               0                         0
                               1
           0                                         01
                               1


Figure 3.22: Running example of learning the unknown FA that accepts set of strings with
even number of 0’s and 1’s.

         Therefore, the access strings in S are prefixes from counterexamples where the size
of S is also the number of counterexamples returned (or number of equivalence queries
made). L2 maintains a fixed-size S where each string represents a distinct state of the
minimum DFA of M. The size of S is at most to the size of the minimum DFA for M at any
point of time during the learning process. Hence, each counterexample produces a new access
string that immediately creates a new conjecture with a newly discovered state.
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc
Figure 2.2.doc.doc

More Related Content

What's hot

LinkingCorpStrategy _ 31_12_2014_Final
LinkingCorpStrategy _ 31_12_2014_FinalLinkingCorpStrategy _ 31_12_2014_Final
LinkingCorpStrategy _ 31_12_2014_Finalluluqa
 
BBA Dissertation- TURN IT IN
BBA Dissertation- TURN IT INBBA Dissertation- TURN IT IN
BBA Dissertation- TURN IT INRashmi Rajpal
 
Market research report
Market research reportMarket research report
Market research reportSunam Pal
 
Ess grade 11 summative october 2012
Ess grade 11 summative october 2012Ess grade 11 summative october 2012
Ess grade 11 summative october 2012GURU CHARAN KUMAR
 
Computer systems servicing cbc
Computer systems servicing cbcComputer systems servicing cbc
Computer systems servicing cbcHanzel Metrio
 
Nofri anten
Nofri antenNofri anten
Nofri antendiwangsa
 

What's hot (7)

LinkingCorpStrategy _ 31_12_2014_Final
LinkingCorpStrategy _ 31_12_2014_FinalLinkingCorpStrategy _ 31_12_2014_Final
LinkingCorpStrategy _ 31_12_2014_Final
 
BBA Dissertation- TURN IT IN
BBA Dissertation- TURN IT INBBA Dissertation- TURN IT IN
BBA Dissertation- TURN IT IN
 
Market research report
Market research reportMarket research report
Market research report
 
Ess grade 11 summative october 2012
Ess grade 11 summative october 2012Ess grade 11 summative october 2012
Ess grade 11 summative october 2012
 
It project development fundamentals
It project development fundamentalsIt project development fundamentals
It project development fundamentals
 
Computer systems servicing cbc
Computer systems servicing cbcComputer systems servicing cbc
Computer systems servicing cbc
 
Nofri anten
Nofri antenNofri anten
Nofri anten
 

Similar to Figure 2.2.doc.doc

Smart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaSmart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaHendrik Drachsler
 
A study on recommendation intention among Malaysian private universities’ und...
A study on recommendation intention among Malaysian private universities’ und...A study on recommendation intention among Malaysian private universities’ und...
A study on recommendation intention among Malaysian private universities’ und...Rahman Karimiyazdi
 
Exploring high impact scholarship
Exploring high impact scholarshipExploring high impact scholarship
Exploring high impact scholarshipEdaham Ismail
 
Cover+ D A F T A R I S I
Cover+ D A F T A R  I S ICover+ D A F T A R  I S I
Cover+ D A F T A R I S IDiana Wijayanti
 
Interdisciplinarity report draft v0 8 21th apr 2010
Interdisciplinarity report draft v0 8 21th apr 2010Interdisciplinarity report draft v0 8 21th apr 2010
Interdisciplinarity report draft v0 8 21th apr 2010grainne
 
MANAGEMENT RESEARCH PROJECT
MANAGEMENT RESEARCH PROJECTMANAGEMENT RESEARCH PROJECT
MANAGEMENT RESEARCH PROJECTERICK MAINA
 
The role of transnational ethnic on socio economic integration in the horn of...
The role of transnational ethnic on socio economic integration in the horn of...The role of transnational ethnic on socio economic integration in the horn of...
The role of transnational ethnic on socio economic integration in the horn of...Mohamed Aden Farah
 
Online Education and Learning Management Systems
Online Education and Learning Management SystemsOnline Education and Learning Management Systems
Online Education and Learning Management SystemsMorten Flate Paulsen
 
Teacher and student perceptions of online
Teacher and student perceptions of onlineTeacher and student perceptions of online
Teacher and student perceptions of onlinewaqasfarooq33
 
PhD Thesis_Digital Media Advertising Attribution
PhD Thesis_Digital Media Advertising AttributionPhD Thesis_Digital Media Advertising Attribution
PhD Thesis_Digital Media Advertising AttributionYunkun Zhao, PhD
 
Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...
Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...
Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...Dr. Mustafa Değerli
 
Big data performance management thesis
Big data performance management thesisBig data performance management thesis
Big data performance management thesisAhmad Muammar
 

Similar to Figure 2.2.doc.doc (20)

Smart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaSmart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao Pargana
 
A study on recommendation intention among Malaysian private universities’ und...
A study on recommendation intention among Malaysian private universities’ und...A study on recommendation intention among Malaysian private universities’ und...
A study on recommendation intention among Malaysian private universities’ und...
 
How does Project Risk Management Influence a Successful IPO Project.doc
How does Project Risk Management Influence a Successful IPO Project.docHow does Project Risk Management Influence a Successful IPO Project.doc
How does Project Risk Management Influence a Successful IPO Project.doc
 
Exploring high impact scholarship
Exploring high impact scholarshipExploring high impact scholarship
Exploring high impact scholarship
 
Cover+ D A F T A R I S I
Cover+ D A F T A R  I S ICover+ D A F T A R  I S I
Cover+ D A F T A R I S I
 
dissertaion-new 8722742
dissertaion-new 8722742dissertaion-new 8722742
dissertaion-new 8722742
 
Interdisciplinarity report draft v0 8 21th apr 2010
Interdisciplinarity report draft v0 8 21th apr 2010Interdisciplinarity report draft v0 8 21th apr 2010
Interdisciplinarity report draft v0 8 21th apr 2010
 
MANAGEMENT RESEARCH PROJECT
MANAGEMENT RESEARCH PROJECTMANAGEMENT RESEARCH PROJECT
MANAGEMENT RESEARCH PROJECT
 
The role of transnational ethnic on socio economic integration in the horn of...
The role of transnational ethnic on socio economic integration in the horn of...The role of transnational ethnic on socio economic integration in the horn of...
The role of transnational ethnic on socio economic integration in the horn of...
 
Online Education and Learning Management Systems
Online Education and Learning Management SystemsOnline Education and Learning Management Systems
Online Education and Learning Management Systems
 
MScDissertation
MScDissertationMScDissertation
MScDissertation
 
Teacher and student perceptions of online
Teacher and student perceptions of onlineTeacher and student perceptions of online
Teacher and student perceptions of online
 
Research handbook
Research handbookResearch handbook
Research handbook
 
AC_2014_V6
AC_2014_V6AC_2014_V6
AC_2014_V6
 
NGO
NGONGO
NGO
 
Vekony & Korneliussen (2016)
Vekony & Korneliussen (2016)Vekony & Korneliussen (2016)
Vekony & Korneliussen (2016)
 
PhD Thesis_Digital Media Advertising Attribution
PhD Thesis_Digital Media Advertising AttributionPhD Thesis_Digital Media Advertising Attribution
PhD Thesis_Digital Media Advertising Attribution
 
Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...
Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...
Mustafa Degerli - 2010 - Dissertation Review - IS 720 Research Methods in Inf...
 
Big data performance management thesis
Big data performance management thesisBig data performance management thesis
Big data performance management thesis
 
Marketing & association banking fyp
Marketing & association banking  fypMarketing & association banking  fyp
Marketing & association banking fyp
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Figure 2.2.doc.doc

  • 1. A Survey of Learning Methods In Learning Finite Automata (FA) Submitted By: Priscilla Chia Supervised By: Dr. Suresh K. Manandhar Final Year Project Report 1998 Computer Science Department University Of York
  • 2. Abstract This report is a survey of learning methods used learning Finite Automata (FA). The learning issues in machine learning are highlighted and the methods surveyed are analysed according to how these issues are dealt with. The report also looks at how additional information is learnt based on given information by the teacher. We surveyed six algorithms with respect to the various learning methods employed in the learning process: building a hypothesis and evaluating the hypothesis. The methods are categorised into probabilistic and non-probabilistic. We conclude with a discussion on the ability of hypothesis towards error rectification of past experience instead of only learning new ones.
  • 3. Acknowledgement I am very grateful to my supervisor, Dr. Suresh Manandhar, for his invaluable help and advise throughout this project. I would also like to thank my friends and family for their support especially mum and dad at home.
  • 4. CONTENTS 1. INTRODUCTION...........................................................................................1 2. LEARNING....................................................................................................2 2.1 Learning in General...........................................................................................................................2 2.2 The Issues in Learning.......................................................................................................................5 2.3 Learning Finite Automata (FA)........................................................................................................6 2.4 Learning Framework.........................................................................................................................7 2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7 2.4.2 PAC VIEW............................................................................................................................8 2.4.3 COMPARISON.........................................................................................................................8 2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9 2.5 Results on learning finite automata................................................................................................10 3. NON-PROBABILISTIC LEARNING FOR FA.............................................11 3.1 Learning with queries......................................................................................................................13 3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14 3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21 3.1.3 DISCUSSION.........................................................................................................................27 3.2 Learning without queries................................................................................................................29 3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30 3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35 3.2.3 DISCUSSION.........................................................................................................................39 3.3 Homing Sequences in Learning FA................................................................................................42 3.3.1 HOMING SEQUENCE...............................................................................................................43 3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43 3.3.3 DISCUSSION.........................................................................................................................46 3.4 Summary (Motivation forward).....................................................................................................47 4. PROBABILISTIC LEARNING.....................................................................50 4.1 PAC learning using membership queries only..............................................................................50 4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50 4.1.2 DISCUSSION.........................................................................................................................51 4.2 Learning through model merging..................................................................................................54 4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54 4.2.2 LEARNING FA: REVISITED....................................................................................................55 4.2.3 BAYESIAN MODEL MERGING.................................................................................................57 4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60 4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64 4.2.6 DISCUSSION.........................................................................................................................67
  • 5. 4.3 Summary...........................................................................................................................................68 4.4 Chapter Appendix............................................................................................................................70 5. CONCLUSION AND RELATED WORK.....................................................71 REFERENCES...............................................................................................74 1. INTRODUCTION...........................................................................................1 2. LEARNING....................................................................................................2 2.1 Learning in General...........................................................................................................................2 2.2 The Issues in Learning.......................................................................................................................5 2.3 Learning Finite Automata (FA)........................................................................................................6 2.4 Learning Framework.........................................................................................................................7 2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7 2.4.2 PAC VIEW............................................................................................................................8 2.4.3 COMPARISON.........................................................................................................................8 2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9 2.5 Results on learning finite automata................................................................................................10 3. NON-PROBABILISTIC LEARNING FOR FA.............................................11 3.1 Learning with queries......................................................................................................................13 3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14 3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21 3.1.3 DISCUSSION.........................................................................................................................27 3.2 Learning without queries................................................................................................................29 3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30 3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35 3.2.3 DISCUSSION.........................................................................................................................39 3.3 Homing Sequences in Learning FA................................................................................................42 3.3.1 HOMING SEQUENCE...............................................................................................................43 3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43 3.3.3 DISCUSSION.........................................................................................................................46 3.4 Summary (Motivation forward).....................................................................................................47 4. PROBABILISTIC LEARNING.....................................................................50 4.1 PAC learning using membership queries only..............................................................................50 4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50 4.1.2 DISCUSSION.........................................................................................................................51 4.2 Learning through model merging..................................................................................................54 4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54
  • 6. 4.2.2 LEARNING FA: REVISITED....................................................................................................55 4.2.3 BAYESIAN MODEL MERGING.................................................................................................57 4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60 4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64 4.2.6 DISCUSSION.........................................................................................................................67 4.3 Summary...........................................................................................................................................68 4.4 Chapter Appendix............................................................................................................................70 5. CONCLUSION AND RELATED WORK.....................................................71 REFERENCES...............................................................................................74 1. INTRODUCTION...........................................................................................1 2. LEARNING....................................................................................................2 2.1 Learning in General...........................................................................................................................2 2.2 The Issues in Learning.......................................................................................................................5 2.3 Learning Finite Automata (FA)........................................................................................................6 2.4 Learning Framework.........................................................................................................................7 2.4.1 IDENTIFICATION IN THE LIMIT....................................................................................................7 2.4.2 PAC VIEW............................................................................................................................8 2.4.3 COMPARISON.........................................................................................................................8 2.4.4 OTHER VARIATIONS OF LEARNING FRAMEWORK............................................................................9 2.5 Results on learning finite automata................................................................................................10 3. NON-PROBABILISTIC LEARNING FOR FA.............................................11 3.1 Learning with queries......................................................................................................................13 3.1.1 L1: BY DANA ANGLUIN [ANGLUIN 87]..................................................................................14 3.1.2 L2: BY KEARNS AND VAZIRANI [KEARNS ET AL 94]................................................................21 3.1.3 DISCUSSION.........................................................................................................................27 3.2 Learning without queries................................................................................................................29 3.2.1 L3: BY PORAT AND FELDMAN [PORAT AND FELDMAN 91]........................................................30 3.2.2 RUNNING L3 ON WORKED EXAMPLES.......................................................................................35 3.2.3 DISCUSSION.........................................................................................................................39 3.3 Homing Sequences in Learning FA................................................................................................42 3.3.1 HOMING SEQUENCE...............................................................................................................43 3.3.2 L4: NO-RESET LEARNING USING HOMING SEQUENCES..................................................................43 3.3.3 DISCUSSION.........................................................................................................................46 3.4 Summary (Motivation forward).....................................................................................................47 4. PROBABILISTIC LEARNING.....................................................................50
  • 7. 4.1 PAC learning using membership queries only..............................................................................50 4.1.1 L5: A VARIATION OF THE ALGORITHM L1 [ANGLUIN 87; NATARAJAN 90]...................................50 4.1.2 DISCUSSION.........................................................................................................................51 4.2 Learning through model merging..................................................................................................54 4.2.1 HIDDEN MARKOV MODEL (HMM).......................................................................................54 4.2.2 LEARNING FA: REVISITED....................................................................................................55 4.2.3 BAYESIAN MODEL MERGING.................................................................................................57 4.2.4 L6: BY STOLCKE AND OMOHUNDRO [STOLCKE ET AL 94]..........................................................60 4.2.5 RUNNING OF L6: ON WORKED EXAMPLES.................................................................................64 4.2.6 DISCUSSION.........................................................................................................................67 4.3 Summary...........................................................................................................................................68 4.4 Chapter Appendix............................................................................................................................70 5. CONCLUSION AND RELATED WORK.....................................................71 REFERENCES...............................................................................................74 References 72
  • 8.
  • 9. Introduction 1 1.Introduction The class of finite state automaton (FA) is studied from machine learning perspective which involves learning issues and the properties of that particular class. This report is a survey on the learning methods studied and employed in learning FA. We give an overview on learning in general in Section 2.1 and the issues of learning in Section 2.2 with application towards learning FA in Section 2.3. The two important frameworks employed extensively in machine learning which are learning in the limit and PAC learning, are explained in Section 2.4. The complexity of learning FA itself has been studied and the results are given in Section 2.5. This report which concerns the learning methods employed are divided into two main chapters where various learning algorithms are studied and compared. The usual non- probabilistic learning is discussed in Chapter 3 with the motivation towards probabilistic learning in Section 3.4 before the probabilistic learning is discussed in Chapter 4. The results of the surveyed is in every chapter and the conclusion with related works in machine learning is in Chapter 4. The are 6 algorithms discussed and each is referred to as L1-L6 throughout this report as the to the following algorithms: • L1: [Angluin 87] • L2: [Kearns et al 94] • L3: [Porat et al 91] • L4: [Rivest et al 87] • L5: [Anlguin 87; Natarajan 91] • L6: [Stolcke et al 94] We follow the standard definition of FA as studied in any automata theory [Hopcroft et al 79; Trakhtenbrot 73] and give the following terminology and notation that are used for any FA M: set of states, Q : the set of finite states q in FA final state : the start reached by an input string that is not recognised by M initial state, q0 : the start state for all input strings accepting state : final state which accepts a string that is not recognised by M rejecting state : final state which rejects a string transition, δ(q,a): path from a state q with symbol a from alphabet set alphabet set, A : the set with finite symbols and binary set {0,1} is applied.
  • 10. Chapter 2: Learning 2 2.Learning 2.1Learning in General Learning in general means the ability to carry out a task with improvement from previous experience. It involves a teacher and a learner. The learning process usually takes place in an environment which constrains the communication between the learner and teacher: how the teacher is to teach or train the learner and how the learner is to receive input from the teacher; and elements or tokens of information that is communicated between the teacher and learner: a class of objects (i.e. a concept) and a description of a subclass (i.e. an object). Example 1(a): Environment for learning a class of vehicles The environment in which the learning process takes place involves a teacher giving descriptions of ships and a learner drawing a conclusion of how a ship looks like from descriptions received. The teacher describes ship (i.e. the subclass of vehicles) by providing pictures (i.e. using pictorial means) of ships and the learner responds (i.e. communicates with the teacher) through some form of visual mechanism (i.e. by detecting shapes or colours of object in pictures) to analyse pictures received from the teacher. This environment only allows the teacher and learner to communicate through pictures, whereas in another environment, other forms of encoding of descriptions may be used (e.g. tables of attributes - width, length, windows, engine capacity etc.) A c1 B : cn ] : r1 : : : cs : : cm ] ri Figure 2.1: Finite representation of a possibly infinite class A of m elements cs for 1≤ s ≤ m where m may be finite of infinite, by another finite class B with p elements, ri for 1≤ i ≤ p where p is some finite number. The learner is to learn an unknown subclass from the class with the help (i.e. some form of training) from the teacher who provides descriptions of the unknown subclass. Since the subclass to be learnt may be infinite in size, a finite representation is needed to represent the probable subclasses hypothesised during learning. The task of the learner is to hypothesise a (finite) representation of the unknown subclass as shown in Figure 2.1 where class A of a possibly finite set is represented by a finite class B. Thus learning the class A is to learn its class B representation. In Example 1(a) above, the learner is to produce a hypothesis of the ship subclass. A finite representation for the hypothesis is necessary for the unknown subclass chosen to be learnt, as not every subclass can be finitely presented or described (i.e. by presenting all elements of the unknown subclass to the learner) by the teacher, as shown in the class of vehicles above where the subclass (i.e. ships) is infinite. Note that there are finitely many
  • 11. Chapter 2: Learning 3 different ‘types’ of ships as there are finitely many different ‘types’ of vehicles where in both cases the ‘ships’ and ‘vehicles’ are infinite but the ‘types’ are finite. Instances from the class used to describe a particular subclass are called examples. These examples are usually classified by a teacher (i.e. a human or some operation or program available in the environment) with respect to the particular subclass being learnt as positive (a member of the subclass) or negative (non-member of the subclass) examples. A set of classified or labelled examples is called the example space. Only a subset of the example space is used by learner each time an unknown subclass is to be learnt. This subset, used in training the learner, is known as the training set. Each example space contains information (implicit properties or rules that may be infinite) relevant in distinguishing one subclass from another in the given class. The constraints in the environment also determine which type (i.e. positive only, negative only or both) of examples can be provided by the teacher to form the example space. For instance, it may not be possible to collect negative examples and the teacher is restricted to only positive examples which may not be a partial set (i.e. not all members of the unknown subclass is known even to the teacher). The learner or learning algorithm is therefore required to learn the implicit properties or rules from the information given (built into what is called experience) of a particular subclass. The properties learnt are stored in the learner’s hypothesis (i.e. conclusion or explanation) drawn of the sub-class. An infinite number of hypotheses of any form of representation (i.e. decision tree, propositional logic expression, finite automata etc.) could be produced that hold the properties obtained from the information received. This results in searching a large hypothesis space. It should be noted that the hypothesis space could be expressed in the same descriptive language used to describe the unknown subclass: In Example 1(a), if the class of vehicles are represented in the form of propositional logic expressions then the hypothesis may be the exact propositional logic expression that represents the unknown subclass chosen (i.e. ships) or in some other form of representation that is equivalent to the propositional logic expressions used. A set of criteria is necessary to limit (reduce) the size of the search space. Given a reduced hypothesis space that satisfies the set of criteria, the learning goal, is then needed in selecting and justifying a hypothesis from the hypothesis space as the finite representation of the unknown subclass. Together with other knowledge about the rules to manipulate the descriptive language, the set of criteria and learning goal form what is called background knowledge to guide the learner in the learning process. Example 1(b): Learning process of Example 1(a) Suppose that the hypothesis for ships take the form of a collection of finite number of attributes for ship (i.e. size, engine capacity, shape, weight, anchor and other properties of a ship). The criteria for hypothesis space could include hypotheses that fulfilled five out of say, six attributes used and the learning goal is to be able to select hypothesis that satisfy the criteria with the simplest data structure of some form and could successfully identify subsequent say, ten examples, correctly. There could be infinite number of attribute used but the criteria in the background knowledge reduces the hypothesis space.
  • 12. Chapter 2: Learning 4 Thus, the learning scenario (Figure 2.2) consists of a given a class, C, of subclasses and an example space, T, from where the training set, t, is drawn. Examples in T are used to describe an unknown subclass, c, in C. The aim of the learning algorithm, L, is to produce a hypothesis, h, from a hypothesis space, H, using information from t and satisfying the conditions set out in the background knowledge. L is to build an h that is equivalent to c. Ideally, h is to be exactly the same as c or h is the exact representation of c. Due to the incompleteness (i.e. teacher usually does not have complete information regarding c) of t received, h is usually taken to be equivalent to c to some extent expressed in the background knowledge. In both cases, learning relies on information contained in t and given by the teacher. L:TH where T : sets of t for a sub-class, c, in C. Also called, example space. L(t) = h (≡R c) t∈T h∈H c∈C ≡R: the equivalence relation specified by the learning goal used in selecting hypothesis, h, using t. The selected h contains learnt properties or rules of c that are obtained through information from t. C T H Background teacher t Knowledge: c h • set of criteria L • learning goal Example • type of space representation h (descriptive L language for H) Class Environment Hypothesis space Figure 2.2: The learning scenario of a learner or learning algorithm, L, with a given environment.
  • 13. Chapter 2: Learning 5 2.2The Issues in Learning The algorithms used in learning are ‘ways’ of achieving the learning goal under the set of criteria in the background knowledge. ‘Ways’ here are methods of constructing hypothesis from information in the set of examples. As shown in Figure 2.2, the learning algorithm, L, has two distinct phases in the learning process: Phase 1: forming hypothesis, h, from set of examples, t. [shown as arrow from T to H in Figure 2.2] Phase 2: selecting and justifying h as a finite representation of the unknown subclass, c. [shown as arrow from H to C in Figure 2.2] The nature (or design) of L, and the feasibility of the learning problem itself is determined by the following factors: 1. Example space, T Usually considered arbitrary where various kind of information (training sets) can be used to describe c. 2. Classification of training set, t, usually by a teacher or operation carried out on the environment with respect to a particular sub-class, c. • Noisy examples are considered where the teacher may classify instances wrongly • Type of examples to be presented (i.e. positive only, negative only or both) 3. Presentation of t to L Whether elements from t is fed into L one-by-one or in a small groups or a whole batch of t and whether the elements are presented in any particular order (i.e. in lexicographic order or shortest length first) 4. The size of the t Intuitively, a small t is needed in learning by an efficient and ‘intelligent’ learner or learning algorithm. In machine learning, the size of t contributes to the computational complexity of a learning algorithm, the larger t is, the longer (or more complicated) is the computation. 5. The choice of representation for the hypothesis space, H This involves issues of how much information to capture and can be captured by a particular choice of representation. A rich descriptive language ideally required as representation means a more complex computation and larger resource (i.e. memory storage) requirement, whereas a simple form of representation may not capture sufficient information to learn. 6. Selection criteria of a hypothesis, h, and justification as an equivalence of c. All of the above except the last factor constitute to a major part in the design of an algorithm in machine learning, to be exhibited in Phase 1 of L. The last factor and also the choice of representation for H, are usually vital in Phase 2 of the learning process where evaluation are carried out by human experts or some known mechanism such as statistical confirmation or analysis. The learner, L, is said to be able to learn a class in the given environment if it can learn (i.e. by producing a hypothesis that satisfies both the criteria and learning goal in the background knowledge provided a priori) any subclass chosen from the class.
  • 14. Chapter 2: Learning 6 2.3Learning Finite Automata (FA) This report investigates the learning process in a particular environment setting (Figure 2.3): - Teacher: as the source of example space, T, where description to unknown subclass, c, is of the form of labelled strings. - Learner: learns by receiving information in a form of labelled string drawn from T following rules set out in the environment constraints. - c: the unknown regular language or FA Two almost similar environments for learning are shown in Figure 2.3 with difference in the class contents. The first environment (Figure 2.3(a)) consists of: - C1: the class of all languages, - H1: the hypothesis space, H1, is finite automaton (FAs) as the finite representation for regular languages (i.e. a subclass of C1). - T: the examples are labelled sets of languages. - Criteria: FA accepts all examples (i.e. strings which may or may not be only positive strings) received from training set, t. - Goal: to produce an FA (i.e. the selected hypothesis) that is equivalent to (i.e. that accepts) c. The other learning environment (Figure 2.3(b)) can be obtained by refining C1 to be the class of regular languages only and the hypothesis space, H2, are minimum deterministic finite automata (DFA). The new environment shown in Figure 2.3(b) has more constraints added into the environment as the teacher is to provide descriptions using only regular languages as compared to C1, where the teacher is able to provide descriptions using other languages as well (i.e. context free languages). This report concerns with the learnability of finite automaton (FA) using minimum DFAs as the hypothesis space. Both environments, with C1 and C2 as classes, use the same set of examples, T, which is a set of strings and the training set, t, is a set of classified strings with respect to a particular sub-class of languages, c. For consistency, throughout the report, the alphabet, A, will for FAs will be set to the binary set {0,1}. C1 T H1 C2 T H2 (class of all (class of FAs ) (class of regular (class of languages) languages ≡ minimum DFAs) minimum DFAs) c’ h c h c (a) (b) Figure 2.3: (a) c’ is the sub-class of regular languages, c, and H1 is the class of FAs with criteria for H1 being deterministic and minimal in size (number of states). (b) c is a particular subset of regular languages and H2 is the class of minimum DFA itself where no criteria is needed.
  • 15. Chapter 2: Learning 7 2.4Learning Framework Given an environment with a class of objects which describe ‘what is to be learnt’, the two phases, Phase 1 and Phase 2, in the learning process bring us two fundamental questions: - ‘how do we learn?’ - ‘when do we know we have learnt?’ The former is being dealt with in Phase 1 and the latter, in Phase 2, is studied by Gold [Gold 67] and Valiant [Valiant 84] resulting in two major learning frameworks being proposed, the identification in the limit by Gold and the probably approximately correct (PAC) learning by Valiant. 2.4.1Identification in the limit [Gold 67] states that learning should be a continuous process with the learner (or learning algorithm), L, having the possibility of changing or refining his guess (i.e. hypothesis) each time new information from the training set, t, is presented. The learner, L, is only required to have all his guesses after a finite time to be the same and correct with information seen so far. Hence, the hypothesis, h, obtained after a finite time will remain the same and correct with subsequent information. The hypothesis, h, is said to represent the unknown sub-class, c, described by t in the limit, completing Phase 2 of the learning process. This learning framework, identification in the limit, consists of three items as formulated by Gold: 1. A class of objects A class, C, is specified (or given) to learner in the environment where the form of communication between the teacher and learner is also specified. An object, c, from C will be chosen for the learner to identify. [In the context of this report, the unknown object (or sub-class), c, is an FA and the class C consists of FAs]. 2. A method of information presentation Information about the unknown chosen object is presented to the learner. The training set, t, consists of either positive only, positive and negative, and noisy examples as information describing c. [t is just a set of labelled strings drawn from example space, T, provided by the teacher and the type of t depends on T – all positive strings, all negative strings or combination of both] 3. A naming relation This basically enables the learner to identify the unknown object, c, by specifying its name1, h . There is a function, f, for L to map the names to the objects in C. Here, an object, c, can have several names (hypotheses) where guesses (or hypotheses are made under f). [L is to build an FA as the hypothesis, h, for an unknown regular language and h could be any of the several DFAs (or TMs) that accepts the unknown regular language]. 1 In [Gold 67], name is defined as a Turing Machine (TM). Since the language identified by FA is also identifiable by a TM, it is sufficient to say that every FA has a TM.
  • 16. Chapter 2: Learning 8 2.4.2PAC view Another learning framework is Probably Approximately Correct (PAC) learning. This was first proposed by Valiant [Valiant 84] that uses a stochastic setting in the learning process. The learner is required to build a (approximately correct) hypothesis that has a minimal error probability after being trained using the training set, t, constituting Phase 1 of the learning process. Phase 2 under this framework requires the learner to have a high level of confidence that the hypothesis, h, is approximately correct as a representation of sub-class, c. The training set, t, is considered ‘good enough’ with high confidence level. This is appropriate because t generally does not consist of all the positive examples needed to learn c. The PAC framework relies on two parameters, accuracy (ε) and confidence limit (δ). A fixed but unknown distribution is applied over the class of examples, T, where training sets, t, are drawn at random. Intuitively, PAC learning seems like a passive type of learning with the learner learning only through observation on given data or information. However, [Angluin 88] and [Natarajan 91] showed that PAC learning can be used as an active learner using queries – equivalence, membership, subset, superset, exhaustiveness and disjointness queries [Angluin 87]. Given a real number, δ, from 0 to 1 and a real number, ε, also from 0 to 1, there is a minimum sample length (i.e. the size of training set, t) such that for any unknown subclass, c, with a fixed but unknown distribution on example space, T, there is: a (1- δ)% chance that ε % of the test set will be classified wrongly by hypothesis h, where test set is another subset of T different from t to test validity of h. PAC learning is desirable for a good approximation to c, as in most cases it is computationally difficult to build an accurate (exact) hypothesis and [Angluin 88] and [Natarajan 91] have shown that PAC learning can be easily applied to any other non- stochastic learning framework. 2.4.3Comparison Both frameworks have distinct criteria and goal for learning which deal with Phase 2 in the learning process (Table 2.1). However, they both suggest learning by building tentative hypotheses from a piece of information in the form of a string from training set, t, (Figure 2.4). Those tentative hypotheses are each a new ‘experience’ (i.e. a modified hypothesis with slight changes or totally new hypothesis) as new information is received from t. The final hypothesis, h’, taken to represent sub-class, c, may be totally different from previous hypotheses. Learning framework Identification in the limit PAC learning Goal Same hypothesis (or guess) P (h – c = ε ) < 1- δ on a after a finite time for sufficiently large sample, t. subsequent information P = the probability function received. h = hypothesis c = the unknown sub-class δ and ε are parameters needed Criteria Hypothesis (guess) made must Hypothesis, h, has minimal be consistent (correct) with error, ε, with respect to T. information seen so far. Table 2.1: Comparison between identification in the limit and PAC learning frameworks.
  • 17. Chapter 2: Learning 9 C T H teacher t ={t1, t2, t3,..} c h1 L h2 h h3 L Environment with oracles Figure 2.4: A learning scenario with learning algorithm, L, making several tentative hypotheses (i.e. h1, h2, h3) in H from sequence of labelled examples (i.e. t1, t2, t3). Recent studies[Kearns et al 94; Rivest et al 88; Porat 91] are done under Gold’s proposed learning framework, as it is more natural to human learning. We can always change our perception (hypothesis) each time a new information is received and still being consistent with the previous information. We never know (or predict) when we finish learning (which is a perpetual process in humans). 2.4.4Other variations of learning framework There are two other learning frameworks mentioned by Gold in [Gold 67]: 1. Finite Identification The learner stops the presentation of information after a finite number of examples and identifies the sub-class, c. The learner is to know when he has acquired sufficient number of examples and therefore able to identify c. 2. Fixed-time Identification A fixed finite time2 is specified a priori (i.e. usually as background knowledge) and independently of the unknown object presented whereby the learner stops learning and identifies the unknown object. These two frameworks seem to ask too much of the learner where the learner is ‘forced’ to identify the sub-class, c, by outputting a hypothesis, h, after some predicted factor or condition is achieved. In finite identification, the learner is able to predict the number of examples needed to learn and stop learning once the predicted number of examples has been presented. On the other hand, the fixed-time identification requires the learner to know in some ways ‘when’ he is able to stop learning. Learning as mentioned earlier, is to identify or distinguish the ambiguous lines separating each sub-class in a learning environment. Being able to tell when exactly (i.e. able to predict those lines) to stop learning means that there is no need for learning to start in the first place. 2 Time is taken throughout the report, to correspond with the computational complexity and the termination of a successful learning algorithm.
  • 18. Chapter 2: Learning 10 2.5Results on learning finite automata. The complexity and learnability of finite automaton identification have received extensive research [Gold 67; Angluin 87; Vazirani et al 88]. The computational complexity is being considered here with respect to the size of the hypothesis space (minimum DFA) searched and the size of training set (examples) required. Other complexity results that have dealt with the computational efficiency are as follows: 1. Identification in the limit and learnability model, [Gold 67] – Gold classifies the classes of languages that are learnable in the limit into three categories of information presentation (Table 2.2). Learning from positive only examples is proven to be NP-complete. 2. Inferring consistent DFA or NFA, of the size factor (1+1/8) of the minimum consistent DFA is NP-complete given positive and negatives examples. [Li et al 87] 3. There is an efficient learning algorithm to find minimum DFA consistent with given positive and negative data and access to membership and equivalence queries [Angluin 87], using observation table as a representation of FA. 4. Learning FA by experimentation3 (as in 4 above) [Kearns et al 94], using classification tree as a representation for FA in polynomial time. 5. State characterisation and Data Matrix Agreement is introduced for the problem of automaton identification [Gold 78] 6. Inferring minimum DFA’s and regular sets using positive and negative examples only is NP-complete. [Gold 67,78] and [Angluin 78] Learnability model Class of languages Anomalous Text - Recursive enumerable - Recursive Informant - Primitive recursive (using positive and negative - Context sensitive examples/instances) - Context free - Regular - Superfinite Text Finite cardinality (using positive only examples/instances) Table 2.2: Learnability and non-learnability of languages [Gold 67] where superfinite language is the class of all finite languages and one infinite regular language. These results shows that inferring DFA directly from just examples are NP-hard and some other learning methods are employed in successfully learning FA. The methods used in successfully learning FA is surveyed in the following chapters. 3 Experimentation – a form of learning where learner is able to experiment with chosen strings (i.e. selected by learner and not from training set provided) during training.
  • 19. Chapter 3: Non-Probabilistic Learning 11 3.Non-Probabilistic Learning for FA In building a hypothesis, h, for an unknown FA, c, the learning algorithm, L, usually receives information (i.e. labelled strings) describing c from a training set, t. L is to build an h that is equivalent to c with the information it received insofar. Ideally, h is to be exactly the same as c. However in practical, as c is unknown, the teacher usually may not have complete information required to build the exact FA and h is then taken to be an approximation to c to some extent to be specified in the background knowledge (i.e. approximately equivalent to c or probably approximately correct h than the usual exact h). Learning relies on L to make several guesses based on information provided by the teacher in the following ‘ways’ to be discussed in this chapter: a) learning with queries, section 3.1 b) learning without queries, section 3.2 c) learning with homing sequences, section 3.3 L is to make guesses about c through a number of tentative hypotheses (i.e. tentative FAs), M’, from the information received. Each guess is a refinement or modification to the previous guess (hypothesis) where new properties of FA (i.e. the characteristic and elements of FA) are discovered. The guess made by L is also called a conjecture. The learner will produce several conjectures until the learning goal is achieved, that is, a final conjecture is accepted as the equivalent FA to c. All information received and properties learnt through the modifications are kept in a data structure. The modification to the data structure is called an update and a new hypothesis is built based on the updated data structure. Hence, the data structure has several roles: a) a representation of properties (to be learnt) of an FA : • the finite number of states • transitions (representing the transition function) • the set of distinguishing strings • the accepting and rejecting states b) a record of modifications made (i.e. updates) • incorporating more information received: strings in t • updating more properties learnt c) a reference to build next tentative FA, M’, after each update The data structures being used in this chapter by the learner are briefly explained below, detailed explanation on the updates are given in the relevant section in brackets: 1. observation table (see section 3.1.1) A two dimensional table, in Figure 3.1, where the rows correspond to the states and the columns correspond to the set of distinguishing strings for FA. The entries in the table are values of ‘0’ and ‘1’ corresponding to the transition function of FA to a rejecting and accepting states respectively.
  • 20. Chapter 3: Non-Probabilistic Learning 12 e1 e2 … Distinguishing strings = {e1 , e2,…} s1 1 0 States = {s1, s2, …} s2 0 0 transition function, δ(q,x) s3 = 0 (= qx is a rejecting state) : = 1 (= qx is an accepting state) : for some string x from state q Figure 3.5: Observation table representing elements of FA: states (rows), distinguishing strings (columns) and transition function (table entries in shaded section) 2. classification tree (see section 3.1.2) A binary classification tree where the leaves correspond to the states in FA and the distinguishing strings are represented by the internal nodes (and root) of the tree, as shown in Figure 3.2. The left and right paths from an internal node correspond transition function of FA to a rejecting and accepting states States = {s1, s2, s3, …} Root (d1) Distinguishing strings = {d1, d2, …} transition function, δ(q,x) d2 = left path (= qx is a rejecting state) respectively. = right path (= qx is an accepting state) s1 for some string x from state q d3 Figure 3.6: Classification tree representing elements of an FA: states (leaves), distinguishing strings (internal nodes including root) and transition function (the right and left paths). : s2 3. minword(q) (see section 3.2.1) A string used to reach a state q in an FA from initial state q0. Thus, the set of minword(q) corresponds to the states in an FA as shown in Figure 3.3. λ q0 0 q1 0,1 0 1 1 1 0 λ 0 (a) (b) 0 q2 q3 1 q0 q1 0 minword(q0) = λ minword(q1) = 0 minword(q0) = λ minword(q2) = 1 minword(q1) = 0 minword(q3) = 01 Figure 3.7: The set of minword(q) for FAs. (a) four minword(q) representing the states in the FA that accepts all strings with even 0’s and 1’s. (b) two minword(q) representing the states in the FA that accept all non-empty strings.
  • 21. Chapter 3: Non-Probabilistic Learning 13 3.1Learning with queries Additional information regarding the unknown c can be requested by L by asking queries [Angluin 88]. The queries can be equivalence query, membership query, subset query, superset query, disjointness query and exhaustiveness query. Two of the six queries are used in the following two algorithms, L1 and L2, (see section 3.1.1 and 3.12) in learning c: 1. Membership queries The teacher returns a yes/no answer when the learner presents an input string, x, of its choice in the query, depending upon whether x is accepted by the unknown FA, c. 2. Equivalence queries The teacher returns a yes answer if the conjecture, M’, is equivalent to c and otherwise returns a counterexample, y, which is a string in the symmetric difference of M’ and c, if M’ is not equivalent to c. Hence, L has access to some oracle (could be the teacher or from some operations available in the environment), creating an active interaction between the learner and teacher in the learning process (Figure 3.8). Both queries require a pair of oracles where each oracle is used in separate stage of learning: a) Phase 1 of learning: updating the data structure used to construct the conjecture, M’ b) Phase 2 of learning: to confirm M’ as a finite representation of c (i.e. when to stop learning) C T H teacher t c h L h L Oracle(s) Environment Figure 3.8: Learning with additional information obtained through access to oracle in the environment.
  • 22. Chapter 3: Non-Probabilistic Learning 14 3.1.1L1: by Dana Angluin [Angluin 87] The observation table (e.g. Figure 3.5) is the data structure used to stores the information and learnt properties about the unknown FA, c. All rows and columns are represented by strings based on information from the training set t and the set of distinguishing strings learnt respectively. Each row is viewed as a vector with attributes values of ‘0’ and ‘1’ (i.e. the ‘0’ and ‘1’ table entries in each row corresponding to each column) representing a state in c. Thus, the string representing each row also represents a state in c. A string is said to represent a state q when it can be used to reach q from the initial state q0. The vectors are used to distinguish the rows, thus, distinguishing the states in c. Alternatively, each row can be viewed as a set of distinguishing strings e where each e represents a column in the table. The table entries of ‘1’ and ‘0’ in a row depends on whether e (for the corresponding column) is the distinguishing string or not to the row (state represented) respectively. There may be rows with the same vector (i.e. with the same set of distinguishing strings) and by the Myhill-Nerode Theorem of equivalence classes, these rows are said to be equivalent to each other, that is, representing the same equivalence class x. Thus, we use the alternative view of a row above in referring to the distinct states represented by these rows. The distinct state, that is, the equivalence class x, is represented by the distinct row vector. From Figure 3.5, there are only two distinct rows, s1 and s2, with vectors ‘0’ and ‘1’ and strings λ and 0 respectively. The rest of the rows have the same vector ‘1’ as row s2. Thus, there are only two distinct states represented by a set of strings {λ} and {0,1,00,01} each. The sets of distinguishing strings are φ and {λ} for the two distinct states respectively. e1 = λ Rows: s1…s5 s1 = λ 0 Columns: e1 s2 = 0 1 training set t= {-λ, +0, +1, +00, +01} s3 = 1 1 distinguishing strings = {λ} s4 = 00 1 States: s1, s2 s5 = 01 1 Figure 3.9: Observation table with five rows representing two distinct states with string from t. We now specify the three main elements in the observation table O, as shown in Figure 3.6, used by the learner L1 during learning to represent properties and information of c: 1. A non-empty prefix-closed* set of strings, S. This set starts with the null string, λ. All the rows in the observation table are each represented by strings in S∪S.A. There are two distinct divisions of rows in O: the upper division (i.e. as shown by the shaded rows in Figure 3.6) of the table is represented by the strings in S and the lower division is represented by strings in S.A. Each row in the upper division is the particular state reachable through some s∈S from the initial state q0. The * A prefix-closed set is where every prefix of each member is also an element of the set.
  • 23. Chapter 3: Non-Probabilistic Learning 15 rows in the lower division of O therefore represent the next-states reached through transitions a∈A from rows in the upper division. Thus, S represents the states discovered (learnt) by learner in the course of learning. 2. A non-empty suffix-closed** set of string, E. This set also starts with an initial null string, λ. The columns in the observation table are represented by the strings in this set. The vector for each row is a collection of strings from E. Thus the distinct subsets of E represented by the distinct row vectors are used to identify the distinct states represented by the strings in S ∪ S.A. From Figure 3.6, each φ, {λ} ⊆ E (represented by vectors ‘0’ and ‘1’ respectively) identifies the two distinct states represented by {λ} and {0,1,00,01} in S ∪ S.A. Thus, E represents the characteristics of states learnt through subsets of strings in E. 3. A mapping function, T: (S ∪ S.A).E  {0,1} where T(x.e) = ‘1’ if the string x.e ∈ c and ‘0’ otherwise with x ∈ S ∪ S.A . Thus, this mapping function represents the transition function of FA, δ(q0,xe). E λ Rows: s1…s5 s1 = λ 0 S S = {λ, 0} s2 = 0 1 E = {λ} s3 = 1 1 S∪S.A = {λ, 0, 1, 00, 01} s4 = 00 1 S.A table entries: T(x.e) s5 = 01 1 where x∈ S ∪ S.A , e∈E Figure 3.10: Observation table O with upper division (shaded section) and lower division of rows from the set S∪S.A. Each of the following two properties of O, closed and consistent, are used by L1 as a guide to carry out updates (i.e. the extension of rows and columns) during learning: a) closed As the lower division in O are next-states of previous states in the upper division on taking transitions on symbols in A, the row vectors in the lower division must therefore also exist in the upper division , the closed property of O. Thus, for every string, s’, in S.A, there is an s in S where both strings, s’ and s, have the same vector. As shown in Figure 3.6, the vectors in the lower division of O existed in the upper division where all next-states are existing states. b) consistent Each pair of vectors with the same subset of distinguishing strings should represent the same state. The next-state vectors from this pair of vectors should be the same vector representing the same next-state reached, the consistent property of O. Thus, any pair of strings, s1, s2 in S, with the property of row(s1)=row(s2), then row(s1.a) = row(s2.a) for all a in A. As shown in Figure 3.7, the rows represented by strings λ and 11is to be consistent when both strings is representing the same ** A suffix-closed set is where every suffix of each member is also an element of the set.
  • 24. Chapter 3: Non-Probabilistic Learning 16 distinct state represented by vector (10) moves into the same next-state(s) represented having the same vectors. λ 0 λ 1 0  previous state 0 0 1 1 0 0  next-state 11 1 0  previous state 00 1 0 01 0 0 10 0 0 110 0 1 111 0 0  next-state Figure 3.11: Observation table which is consistent, where two rows λ and 11 represented by the same row vector (1 0) has the same row vector (00) representing the same next-state reached from both rows in the upper division (shaded region). The observation table O is updated by extending the rows and columns (discovering more states and the characteristics of each state) using membership queries and equivalence queries, as shown in Figure 3.8 and Figure 3.9. An update is carried out in two circumstances: T1 λ S ∪ S.A = {λ, 0, 1} λ 0 T(λ.λ) = 0 row vector not in 0 1 T(0.λ) = 1 upper division  T(1.λ) = 1 1 1 make closed: S ∪ {0} T2 λ λ (a) 0 S = {λ, 0} (b) newly added row 0 1 E = {λ} with vector (0)  S ∪ S.A = {λ, 0, 1, 00, 01} 1 1 00 1 01 1 Figure 2.12: (a) Observation table T1 not closed with vector (0). (b) a closed T2 with extension to T1 adding new row into table representing new state discovered. a) when either one of the close and consistent properties of O does not hold: • O is not closed when a vector is not represented in the upper division. A new state is said to be discovered (learnt) as it is a non-existing next- state. From Figure 3.8(a) the row with vector (0) in the lower division is not represented in the upper division, indicating that the next-state is not an existing state. Then O is updated by S ∪ {s’} where s’ ∈ S.A Thus, Figure 3.8(b) shows the updated O with new string (row) in S and new row in the upper division representing the new state learnt.
  • 25. Chapter 3: Non-Probabilistic Learning 17 [Adding s’ into S still maintains the prefix-closed property of the set as s’ is an element of S being appended with an input letter from the alphabet.] Note: Membership queries are used to complete the table entries whenever E or S is extended. The queries are made on strings in the (S ∪ S.A).E where a yes answer from the teacher means a ‘1’ entry in O and vice versa. • O is inconsistent when two rows with the same vector have a pair of different next-state vectors. This indicates that one of the pair of strings s, s’ in S actually represent a different (newly discovered) state not in the existing states (rows). As in Figure 3.9(a), pair of rows with same vector lead to different next-state on transition ‘1’ in O1. Then O is updated by E ∪ {a.e} where a is the transition symbol which brought the two states to a different next-state and e is the element in E where the next-state vector differs (i.e. at one of the attributes). Thus, Figure 3.9(b) shows the updated O2 with an extra column represented by string ‘1’, the transition symbol which brought the pair of rows to different rows. The e element previous E where the difference is seen is λ. All the table entries resulting from this additional column are filled in using membership queries on the new (S ∪ S.A).E. [The suffix-closed property of E is also maintained with ‘a.e’ added to E, where e is the previous suffix element being added to the set before ‘a.e’.] O1 λ 0 O2 λ 0 0 λ 1 0 λ 1 0 0 0 0 1 same: 0 0 1 0 01 0 0  current state 01 0 0 0 010 0 0  current state 010 0 0 1  new row 1 0 0 1 0 0 1 (new vector) 00 1 0 different: 00 1 0 0 011 0 1  next-state 011 0 1 0 0100 0 0 0100 0 0 0 0101 1 0  next-state 0101 1 0 0 The e column which the rows Make consistent differs at shaded entries, ‘0’ E∪{1.λ} and ‘1’ Figure 3.13: (a) O1 is inconsistent with different next-state vector for a pair of rows with same vectors representing same state. (b) An updated O2 obtained with newly learnt state represented by the new row with new vector in the upper division of new table. c) when a counterexample y is returned from an equivalence query S is extended during learning to include all the prefixes of y. Thus, the upper division of table is extended with new strings and membership queries are used to fill in all new entries. We now have the questions of “when to built a tentative M using the data structure?” and “how a tentative M’ is built from data structure?”. The tentative M’ in Figure 3.10(a) is built only when the observation table O has both the properties of closed and consistent, as in (a) where all upper rows with the same vectors leads to a pair of rows with the same vector and all vectors in the lower division are represented in the upper division.
  • 26. Chapter 3: Non-Probabilistic Learning 18 The latter question is answered with a closed and consistent O. This closed and consistent O is used to build a tentative deterministic FA (DFA), M’, with each distinct vectors (i.e. distinct rows) in the upper division representing a state in M’. Then, the M’ is completed by having transitions on all symbols in A from every states. The next-state here is determined by a look-up at the rows represented by the string s.a (i.e. the resulting string from taking a transition a from row s) in the table for the corresponding vector to each strings. From Figure 3.10(b), the conjecture M’ is built from the closed and consistent observation table O, in Figure 3.10(a). The states in M’ are the distinct vectors in the upper division, which are each shared among strings representing all the rows in O. M’ is the minimum DFA that accepts all non-empty strings as equivalence query on M’ returns a yes answer. O Col1 (= λ) s2 = {0, 1,00,01} s1 = λ T(λ) = 0 Figure 3.14: (a) s1 = {λ}  distinct vector s2 = 0 1 The distinct vector  observation 0,1 s3 = 1 1 table, the final the (also O, for state) s4 = 00 1 unknown FA, c, λ s5 = 01 1 recognising the set of all non- M’ 0,1 empty strings. The rows are elements of S ∪ S.A and columns are elements from E. (b) The (b) conjecture, M’, is constructed using the closed E = {λ} and consistent O. The final state being the row having vector (1) (bold arrow). λ is always the S = {λ, 0} initial state being the first row in the table, a non-accepting state in this case. The next-state (a) transitions are the strings {0,1,00,01}. Each conjecture, M’, is then presented to the teacher in the form of an equivalence query. At this point, if the guess is correct, no counterexample is returned and M’ is the minimum DFA equivalent to c, as in Figure 3.10(b) where an equivalence query on M’ returns a yes. Thus L1 stops learning and output M’ as its hypothesis. The conjecture M’ is a minimum DFA representing the unknown FA. However, if a counterexample is returned, an update is carried out to the observation table (i.e. adding all prefixes to S) and another update if the updated table with additional prefixes is not closed and/or inconsistent. Next conjecture is built when both properties are satisfied. Membership queries are used to fill in new entries for the new rows obtained from the extended S where counterexample and its prefixes are the learner’s choice in presenting membership queries. This minimality on the number of states in the conjectured DFAs is maintained by the closed and consistent property of the observation table. Through the consistent and closed test on every updated table, two rows that have the same vector is considered as belonging to the same equivalent class by Myhill-Nerode Theorem (i.e. class x with the same behaviour for a set of distinguishing strings). Thus, building a conjecture only if a closed and consistent observation table is obtained after every update and taking only the distinct vectors as representing the distinct states in a building DFA always results in a minimum DFA. “How to start learning?”. This question brings us to the important role of the null string λ, which both S and E starts with as the first element. This string not only brings us to
  • 27. Chapter 3: Non-Probabilistic Learning 19 discovering the initial state q0 (being the first row in the table) but also as the distinguishing string which is uses to decide which of the distinct vectors are accepting or rejecting states. Being the first element of E therefore allows every string in all rows to be queried by the learner whether it is accepted or rejected by c in the membership queries. Thus, a row which has the λ as its set of distinguishing strings indicate that the vector with ‘1’ entry in the λ column must represents an accepting state as the string represented by that vector is accepted. From Figure 3.10(a), the vector (1) for row with string ‘0’ represents an accepting state as ‘0’ is accepted at column represented by λ, which is also in the set of distinguishing strings for row ‘0’ indicated by the ‘1’ entry. Learning process thus starts with S and E having only one element (i.e. the null string) and the initial table with only a column and three rows (one row for λ in the upper division and the other 2 rows for the next-state rows in the lower division). Another illustration in is shown in , with learner trying to learn the FA that accepts all string with even 0’s and 1’s. The initial table is constructed as Oo which is not closed. L1 updates the table until an equivalence query initiate a termination by a yes answer for conjecture M1 after five updates (i.e. five observation tables) and two conjectures. The examples required by the learner are obtained through membership queries and counterexamples both from the training set t consisting of positive and negative examples (i.e. the ‘0’ and ‘1’ entries accepted and rejected strings). O0 λ O1 λ O1 1 λ 1 make close λ 1 S={λ,0} 1,0 0 0 S∪{0} 0 0 E={λ} λ 0 1 0 1 0 S={λ} 00 1 M0 : Equivalence query  no (y=010) E={λ} 01 0 O2 λ O3 λ 0 O4 λ 0 1 λ 1 λ 1 0 λ 1 0 0 0 0 make 0 0 1 make 0 0 1 0 01 0 consistent  01 0 0 consistent  01 0 0 0 010 0 E={λ}∪{0.λ) 010 0 0 E∪{1.λ} 010 0 0 1 1 0 1 0 0 1 0 0 1 00 1 00 1 0 00 1 0 0 011 0 011 0 1 011 0 1 0 0100 0 0100 0 0 0100 0 0 0 0101 1 0101 1 0 0101 1 0 0 O4 is closed and consistent S = {λ, 0, 01,0 10} E = {λ, 0, 1} λ 0 0 λ Equivalence query  yes M1 0 1 1 1 1 0 010 01 0
  • 28. Chapter 3: Non-Probabilistic Learning 20 Figure 3.15: Running examples of learning the unknown FA that accepts the set of all strings with even number of 0’s and 1’s.
  • 29. Chapter 3: Non-Probabilistic Learning 21 3.1.2L2: by Kearns and Vazirani [Kearns et al 94] This algorithm uses the same principles as L1 (i.e. using membership and equivalence queries and positive and negative examples) but the data structure used to construct the tentative FA, M’, is a classification tree, as shown in Figure 3.19. The leaves of the classification tree represent the states learnt (known) in c and the nodes represent the distinguishing strings required to distinguish (discover) the states in c. All the nodes and leaves are represented by a string each, based on the information received from counterexamples and membership queries on chosen strings. Root (d1= λ) T1 Nodes : d1, d2, d3 Leaves: s1, s2, s3, s4 d2 = 0 States : s1, s2, s3, s4 s1 = λ Distinguishing strings = {λ, 0, 1} training set t = {+λ, -0, -1, -01} d3 = 1 s2 = 0 s4 = 01 s3 = 1 Figure 3.16: Classification tree, T1, with 3 node representing 3 distinguishing strings and 4 leaves each representing an equivalence class. The Myhill-Nerode Theorem is also adopted by L2, that is, maintaining the set of distinguishing strings that distinguishes between the equivalent states represented as leaves in the tree. The leaves can be viewed as the equivalence class x containing a set of strings (representative strings) having the same behaviour (distinguishability) with respect to c and the set of distinguishing strings. Thus, each node is seen as the distinguishing string between the children on its right subtree and left subtree (i.e. the leaves in the subtrees) to accepting strings and rejecting string respectively. In Figure 3.12, the node d3, represented by string 1, distinguishes between the leaves, 01 and 1, in its right and left subtree respectively with respect to the FA that accepts all strings with even 0’s and 1’s. The next-state which a string x reached with transition symbol a is determined by traversing the tree with the string xa starting from the root until a leaf s is reached. At each node d visited, the next path to take depends on whether the string xad is accepted or rejected by c. The right path is taken if xad is accepted by the unknown FA and left path if otherwise. The leaf s reached is the equivalence class where x belongs. Thus, xa is said to represent the state represented by s. The membership queries are used here to query which path to take with xad being the string of the learner’s choice. As in Figure 3.12, the string 01 when traversed through the tree landed up in leaf s4 where the string 011 is rejected at node d3 in T1 by taking the left path to leaf s4. However, in T2 from Figure 3.13, it is accepted as the traversal reached a right leaf s1 from the root d1. Thus, the string 01 is said to represent state s4 and s1 in T1 and T2 respectively with respect to the FAs being learnt. Figure 3.17: Classification tree, T2 with one node and two leaves represented by the strings in Root Nodes : d1 T2 (d1 = λ) Leaves: s1, s2 States : s1, s2 Distinguishing strings = {λ} training set t = {-λ, +01} s2 = λ s1 = 01
  • 30. Chapter 3: Non-Probabilistic Learning 22 The classification tree, T, maintains two main elements to represent the properties learnt of c and also the information received from the training set. The elements are specified as the following: 1. a set of access strings, S The initial set contains only one string, the null string λ. The leaves in T are each represented by strings in S. All the leaves then represent the known states discovered insofar of the unknown FA. The leaves in left subtree of root contains all s in S that are rejected by c and the leaves in the right subtree of root are strings that are accepted by c. Thus, S is subdivided into 2 subsets of accepting and rejecting states (i.e. the leaves). From Figure 3.13, S is the set of strings representing the leaves and the 2 subsets for T2 are {λ} and {01} where both sets are the positive and negative set of examples from t. 2. a suffix-closed set of distinguishing strings, D The initial D starts with the null string, λ, and is used to distinguish each pair of access strings in S. The strings in D represent the nodes of the T. Each node, d, has exactly two children, distinguishing a pair of strings in S such that the right subtree consists of strings s.d that are accepted by the unknown FA at node d and vice versa. (a) t = {+λ} + As in Figure 3.13, the root is the distinguishing string for the leaves in the right and left subtree from the root where strings λ and 01 representing the leaves are also representing the rejecting and accepting states respectively. The classification tree is used to build every conjecture M’ except the initial conjecture (i.e. the learner’s first guess) of FA, M0. There are only two different initial conjectures to choose from as M0, as shown by the initial conjectures in Figure 3.14(a) and Figure 3.14(b) which has only the single start state with all transition to itself. This initial guess depends on a membership query on the null string λ. Thus, M0 either accepts or rejects the set of all strings depending on whether the λ is accepted by the unknown M, that is, the initial state is an accepting state if λ is accepted by M and vice versa. T0 0 M +λ (a) 0,1 λ y +λ M0 (b) 0,1 -λ T0 -λ y λ (b) t = {-λ} Figure 3.18: (a) M0 accepting all strings as t provide the positive example, +λ. (b) M0 accepts the empty set where t provide the negative example, -λ. The corresponding tree T0 is an incomplete tree to be completed with the counterexample returned after an equivalence query on M0.
  • 31. Chapter 3: Non-Probabilistic Learning 23 As in L1, for every conjecture M’ produced, an equivalence query on M’ is presented by the learner. L2 terminates if no counterexample is returned. Thus, the first guess from the learner is whether all strings are accepted or rejected by the unknown M. The first counterexample represents the remaining unrepresented leaf y in the incomplete T0 for the initial conjectures, as in Figure 3.19, where y is the one of the leaves for each tree. The subsequent counterexample y returned is analysed using the divergence concept (see below) and a current classification tree, T. From Figure 3.15(a), the unknown FA accepting all non-empty strings results in the initial M0 being the DFA accepting the empty set. An equivalence query to M0 returns the counterexample string +01. As λ is rejected at the root, the first classification tree, T0, as shown in Figure 3.19(b), has λ as its left child and the counterexample 01 as its right child. M0 λ λ 0,1 T0 Root (≡ λ) M’ S = {λ} λ 01 D = {λ} 0,1 λ Equivalence query  (y=01) λ y=01 S = {λ, 01} 0,1 (a) D = {λ} S = {λ, 01} D = {λ} counterexample, y = 01 Equivalence query  yes (b) (c) Figure 3.19: An unknown FA that accepts the set of all non-empty strings is being learnt; (a) 0 0 The initial conjecture, M ; (b) The classification tree, T , with 2 leaves from S and a node (root) from D; (c) The conjecture, M’, is constructed using the classification tree in (b) with the first leaf, λ as the initial state and the second leaf, represented by string, 01, as the other state in M’. As 01 is a leaf on the right subtree from root, the final state is represented by the leaf, 01.
  • 32. Chapter 3: Non-Probabilistic Learning 24 In Figure 3.15(b), T0 is then used to build a tentative deterministic FA (DFA), M’ (Figure 3.19(c)), using the leaves λ and 01 to represent the states of M’ where all states in M’ are labelled with the leaves of T0. The next-state transitions for every states in M’ is done by traversing T0 with the string representing each state appended with a transition symbol a. The strings used for the next-state transition are {011,010, 0, 1}. The next equivalence query returns a ‘yes’ which terminates learning, as in Figure 3.15(c). From every counterexample y, each prefix of y is analysed to determine the prefix yi that leads to different states when it is tested on both the current T and the conjecture M’ which returns y in the equivalence query on M’. Both tests will result in a pair of states: a leaf and a state from T and M’ respectively. Since M’ is built by taking all the leaves in T to represent the states in conjecture, then the pair of states from the tests should point to the same state for a string if T and M’ are equivalent. Thus, there must be a node and transition symbol (path) that yi takes leading to first different pair of states. This is called the divergence point. Thus, a counterexample indicates that somewhere along the string, y = y1…yn for n input symbols in y, at one of the prefix, ym, M’ and T diverge into a different path leading to different state. The divergence point is ym-1 (i.e. the immediate prefix before ym where pair of different states sM and sT is obtained from M and T respectively in the test. where ym is the prefix where divergence occurs The current tree, T, is used to trace the common ancestor for both sM and sT, that is, the node d, that distinguishes the leaves represented by sM and sT . Both d and ym-1 are used to update the classification tree. Figure 3.16 shows how the divergence point is found from counterexample y. λ T’ M’ 0,1 λ 0,1 0 λ 0 01 0 λ 1 0 01 y = 11 0 Prefixes y1: {1, 11} y = 11  Equivalence query (M’) where i = 1, 2 y1 = 1  sM = 01 Divergence point = y1 sT = 01 common ancestor d = 0 y2 = 11 sM = 01 sT = λ Figure 3.20: The unknown FA accepts all string with even 0’s and 1’s. The conjectured M’ is returned with counterexample y = 11 in equivalence query. Each prefixes of y is traversed in T’ and also M’ to find the divergence point, y1 = 1 where divergence occurs at y2. The common ancestor for 01 and λ where y2 diverged to is 0 from T’.
  • 33. Chapter 3: Non-Probabilistic Learning 25 As the learner learns more information and properties of c, the new information and properties (regarding new states) are updated in the tree by extending the nodes and leaves. These updates are carried out only using the information from equivalence query, that is, the counterexample. The counterexample is then analysed for divergence point and the results from the divergence analysis are the common ancestor d and the prefix ym-1, representing the divergence point. Therefore, the tree is updated using d and ym-1 as follows: a) new access string, ym-1 (i.e. a prefix of y), to add to S a new state represented by ym-1 is discovered with a new leaf is to be updated onto T as S is extended. [S is extended to include the prefix of a counterexample representing a known state of c discovered as shown in Figure 3.17 by the shaded leaf 1] b) new distinguishing string, a.d, to add to D where d: the most common ancestor for sT and sM a: the input letter that leads ym-1 to sT and sM (i.e. ym = ym-1.a) ym = y1…ym (the prefix of y) ym = mth symbol in y sT and sM : states reached in T and M [D is extended when the counterexample is returned and a prefix (i.e. to be included in the extension of S) of the counterexample is identified as the point of divergence. D is to include a new distinguishing the string, a.d, where d is the common ancestor node and ‘a’ is the input letter leading from divergence point the two different strings reached. As shown in Figure 3.17 where the new node is shaded.] The new extensions involve sT being replaced by the new string a.d forming a new internal node. The leaves, sT and ym-1, are the children of the new node and their position depends on the acceptance of each concatenated with a.d through membership queries. The suffix-closed property of D maintains reachability from other states to the final state(s) each time a new state (leaf) is discovered (i.e. added into S). Figure 3.17 shows how the tree T’ in Figure 3.16 is updated and used to build a new conjecture M”. λ 0 λ 0 λ T” 0 0 λ 1 1 1 1 1 0 New conjecture 0 01 1 0 1 01 Figure 3.21: The new updated tree from Figure 3.16, T”, has new node and new leaf (in shades) being added when a divergence point is found in previous counterexample. New conjecture M” is queried for equivalence query and no counterexample returned.
  • 34. Chapter 3: Non-Probabilistic Learning 26 We illustrate another example of L2 learning the FA that accepts the strings with even 0’s and 1’s in Figure 3.18 below. The divergence point in done and represented by ym. Conjecture Equivalence Classification tree query λ M0 λ  no T0 0,1 y = 01 S={λ,01} D={λ) 01 λ λ λ M1 0,1 0,1  no T1 λ 0 y = 00 S={λ,01,0} λ λ 01 ym = 00 D={λ,0} 0 01 λ 0,1 M2 λ 0,1  no T2 0 0 01 y = 11 S={λ,01,0,1} λ λ 1 ym = 11 D={λ,0,1} 0 1 0 0 01 1 λ 1 M3 1  yes λ 1 0 0 0 0 1 0 01 1 Figure 3.22: Running example of learning the unknown FA that accepts set of strings with even number of 0’s and 1’s. Therefore, the access strings in S are prefixes from counterexamples where the size of S is also the number of counterexamples returned (or number of equivalence queries made). L2 maintains a fixed-size S where each string represents a distinct state of the minimum DFA of M. The size of S is at most to the size of the minimum DFA for M at any point of time during the learning process. Hence, each counterexample produces a new access string that immediately creates a new conjecture with a newly discovered state.