Your SlideShare is downloading. ×
Chaptr 7 (final)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Chaptr 7 (final)

161
views

Published on

AI notes

AI notes


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
161
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Concept of LearningRote LearningLearning by Taking AdviceLearning in Problem SolvingLearning by InductionExplanation-Based LearningLearning AutomationLearning in Neural NetworksUnit 7Learning Learning Objectives After reading this unit you should appreciate the following: • Concept of Learning • Learning Automation • Genetic Algorithm • Learning by Induction • Neural Networks • Learning in Neural Networks • Back Propagation NetworkTopConcept of LearningOne of the most often heard criticisms of AI is that machines cannot be called intelligent until theyare able to learn to do new things and to adapt to new situations, rather than simply doing as theyare told to do. There can be little question that the ability to adapt to new surroundings and tosolve new problems is an important characteristic of intelligent entities. Can we expect to seesuch abilities in programs? Ada Augusta, one of the earliest philosophers of computing, wrote thatThe Analytical Engine has no pretensions whatever to originate anything. It can do whatever weknow how to order it to perform.
  • 2. LEARNING 179Several AI critics have interpreted this remark as saying that computers cannot learn. In fact, itdoes not say that at all. Nothing prevents us from telling a computer how to interpret its inputs insuch a way that its performance gradually improves.Rather than asking in advance whether it is possible for computers to "learn," it is much moreenlightening to try to describe exactly what activities we mean when we say "learning" and whatmechanisms could be used to enable us to perform those activities. Simon has proposed thatlearning denotes changes in the system that are adaptive in the sense that they enable thesystem to do the same task or tasks drawn from the same population more efficiently and moreeffectively the next time.As thus defined, learning covers a wide range of phenomena. At one end of the spectrum is skillrefinement. People get better at many tasks simply by practicing. The more you ride a bicycle orplay tennis, the better you get. At the other end of the spectrum lies knowledge acquisition. As wehave seen, many AI programs draw heavily on knowledge as their source of power. Knowledge isgenerally acquired through experience and such acquisition is the focus of this chapter.Knowledge acquisition itself includes many different activities. Simple storing of computedinformation, or rote learning, is the most basic learning activity. Many computer programs, e.g.,database systems, can be said to "learn" in this sense, although most people would not call suchsimple storage, learning. However, many AI programs are able to improve their performancesubstantially through rote-learning technique and we will look at one example in depth, thechecker-playing program of Samuel.Another way we learn is through taking advice from others. Advice taking is similar to rotelearning, but high-level advice may not be in a form simple enough for a program to use directly inproblem solving. The advice may need to be first operationalized.People also learn through their own problem-solving experience. After solving a Complexproblem, we remember the structure of the problem and the methods we used to solve it. Thenext time we see the problem, we can solve it more efficiently. Moreover, we can generalize fromour experience to solve related problems more easily contrast to advice taking, learning fromproblem-solving experience does not usually involve gathering new knowledge that waspreviously unavailable to the learning program. That is, the program remembers its experiencesand generalizes from them, but does not add to the transitive closure of its knowledge, in thesense that an advice-taking program would, i.e., by receiving stimuli from the outside world. In
  • 3. 180 ARTIFICIAL INTELLIGENCElarge problem spaces, however, efficiency gains are critical. Practically speaking, learning canmean the difference between solving a problem rapidly and not solving it at all. In addition,programs that learn though problem-solving experience may be able to come up with qualitativelybetter solutions in the future.Another form of learning that does involve stimuli from the outside is learning from examples. Weoften learn to classify things in the world without being given explicit rules. For example, adultscan differentiate between cats and dogs, but small children often cannot. Somewhere along theline, we induce a method for telling cats from dogs - based on seeing numerous examples ofeach. Learning from examples usually involves a teacher who helps us classify things bycorrecting us when we are wrong. Sometimes, however, a program can discover things withoutthe aid of a teacher.AI researchers have proposed many mechanisms for doing the kinds of learning describedabove. In this chapter, we discuss several of them. But keep in mind throughout this discussionthat learning is itself a problem-solving process. In fact, it is very difficult to formulate a precisedefinition of learning that distinguishes it from other problem-solving tasks.TopRote LearningWhen a computer stores a piece of data, it is performing a rudimentary form of learning. After all,this act of storage presumably allows the program to perform better in the future (otherwise, whybother?). In the case of data caching, we store computed values so that we do not have torecompute them later. When computation is more expensive than recall, this strategy can save asignificant amount of time. Caching has been used in AI programs to produce some surprisingperformance improvements. Such caching is known as rote learning.
  • 4. LEARNING 181 Figure 7.1: Storing Backed-Up ValuesWe mentioned one of the earliest game-playing programs, Samuels checkers program. Thisprogram learned to play checkers well enough to beat its creator. It exploited two kinds oflearning: rote learning and parameter (or coefficient) adjustment. Samuels program used theminimax search procedure to explore checkers game trees. As is the case with all suchprograms, time constraints permitted it to search only a few levels in the tree. (The exact numbervaried depending on the situation.) When it could search no deeper, it applied its static evaluationfunction to the board position and used that score to continue its search of the game tree. When itfinished searching the tree and propagating the values backward, it had a score for the positionrepresented by the root of the tree. It could then choose the best move and make it. But it alsorecorded the board position at the root of the tree and the backed up score that had just beencomputed for it as shown in Figure 7.1(a).Now consider a situation as shown in Figure 7.1(b). Instead of using the static evaluation functionto compute a score for position A, the stored value for A can be used. This creates the effect of
  • 5. 182 ARTIFICIAL INTELLIGENCEhaving searched an additional several ply since the stored value for A was computed by backingup values from exactly such a search.Rote learning of this sort is very simple. It does not appear to involve any sophisticated problem-solving capabilities. But even it shows the need for some capabilities that will becomeincreasingly important in more complex learning systems. These capabilities include:1. Organized Storage of Information-in order for it to be faster to use a stored value than it would to recompute it, there must be a way to access the appropriate stored value quickly. In Samuels program, this was done by indexing board positions by a few important characteristics, such as the number of pieces. But as the complexity of the stored information increases, more sophisticated techniques are necessary.2. Generalization-The number of distinct objects that might potentially be stored can be very large. To keep the number of stored objects down to a manageable level, some kind of generalization is necessary. In Samuels program, for example, the number of distinct objects that could be stored was equal to the number of different board positions that can arise in a game. Only a few simple forms of generalization were used in Samuels program to cut down that number. All positions are stored as though White is to move. This cuts the number of stored positions in half. When possible, rotations along the diagonal are also combined. Again, though, as the complexity of the learning process increases, so too does the need for generalization. .At this point, we have begun to see one way in which learning is similar to other kinds of problemsolving. Its success depends on a good organizational structure for its knowledge base. Student Activity 7.1Before reading next section, answer the following questions.1. Discuss role of A.I. in learning.2. What do you mean by skill refinement and knowledge acquisition?3. Would it be reasonable to apply rote learning procedure to chess? Why?If your answers are correct, then proceed to the next section.TopLearning by Taking Advice
  • 6. LEARNING 183A computer can do very little without a program for it to run. When a programmer writes a seriesof instructions into a computer, a rudimentary kind of learning is taking place. The programmer isa sort of teacher, and the computer is a sort of student. After being programmed, the computer isnow able to do something it previously could not. Executing the program may not be such asimple matter, however. Suppose the program is written in a high-level language like LISP. Someinterpreter or compiler must intervene to change the teachers instructions into code that themachine can execute directly.People process advice in an analogous way. In chess, the advice "fight for control of the center ofthe board" is useless unless the player can translate the advice into concrete moves and plans. Acomputer program might make use of the advice by adjusting its static evaluation function toinclude a factor based on the number of center squares attacked by its own pieces.Mostow describes a program called FOO, which accepts advice for playing hearts, a card game.A human user first translates the advice from English into a representation that FOO canunderstand. For example, "Avoid taking points" becomes: (avoid (take-points me) (trick))FOO must operationalize this advice by turning it into an expression that contains concepts andactions FOO can use when playing the game of hearts. One strategy FOO can follow is toUNFOLD an expression by replacing some term by its definition. By UNFOLDing the definition ofavoid, FOO comes up with: (achieve (not (during (trick) (take-points me))))FOO considers the advice to apply to the player called "me." Next, FOO UNFOLDs the definitionof trick: (achieve (not (during (scenario (each pI (players) (play-card p1)) (take-trick (trick-winner)) (take-points me))))In other words, the player should avoid taking points during the scenario consisting of (1) playersplaying cards and (2) one player taking the trick. FOO then uses case analysis to determine which
  • 7. 184 ARTIFICIAL INTELLIGENCEsteps could cause one to take points. It rules out step 1 on the basis that it knows of nointersection of the concepts take-points and play-card. But step 2 could affect taking points, soFOO UNFOLDs the definition of take-points: (achieve (not (there-exists c1 (cards-played) (there-exists c2 (point-cards) (during (take (trick-winner) cl) (take me c2))))))This advice says that the player should avoid taking point-cards during the process of the trick-winner taking the trick. The question for FOO now is: Under what conditions does (take me c2)occur during (take (trick-winner) cl)? By using a technique called partial match, FOO hypothesizesthat points will be taken if me = trick-winner and c2 = cl. It transforms the advice into: (achieve (not (and (have-points (cards-played > = (trick-winner) me))))This means, "Do not win a trick that has points." We have not travelled very far conceptually from"avoid taking points," but it is important to note that the current vocabulary is one that FOO canunderstand in terms of actually playing the game of t. hearts. Through a number of othertransformations, FOO eventually settles on: (achieve (>= (and (in-suit-led (card-of me)) (possible (trick-has-points))) (low (card-of me)))In other words, when playing a card that is the same suit as the card that was played first, if thetrick possibly contains points, then playa low card. At last, FOO has translated the rather vagueadvice "avoid taking points" into a specific, usable heuristic. FOO is able to playa better game ofhearts after receiving this advice. A human can watch FOO play, detect new mistakes, andcorrect them through yet more advice, such as "play high cards when it is safe to do so." Theability to operationalize knowledge is critical for systems that learn from a teachers advice. It isalso an important component of explanation-based learning.TopLearning in Problem Solving
  • 8. LEARNING 185In the last section, we saw how a problem solver could improve its performance by taking advicefrom a teacher. Can a program get better without the aid of a teacher? It can, by generalizing fromits own experiences.Learning by Parameter AdjustmentMany programs rely on an evaluation procedure that combines information from several sourcesinto a single summary statistic. Game-playing programs do this in their static evaluation functions,in which a variety of factors, such as piece advantage and mobility, are combined into a singlescore reflecting the desirability of a particular board position. Pattern classification programsoften combine several features to determine the correct category into which a given stimulusshould be placed. In designing such programs, it is often difficult to know a priori how much weightshould be attached to each feature being used. One way of finding the correct weights is to beginwith some estimate of the correct settings and then to let the program modify the settings on thebasis of its experience. Features that appear to be good predictors of overall success will havetheir weights increased, while those that do not will have their weights decreased, perhaps evento the point of being dropped entirely.Samuels checkers program exploited this kind of learning in addition to the rote learningdescribed above, and it provides a good example of its use. As its static evaluation function, theprogram used a polynomial of the form C1tl + C2t2 + ...+ C16t16The t terms are the values of the sixteen features that contribute to the evaluation. The terms arethe coefficients (weights) that are attached to each of these values. As learning progresses, thevalues will change.The most important question in the design of a learning program based on parameter adjustmentis "When should the value of a coefficient be increased and when should it be decreased?" Thesecond question to be answered is then "By how much should the value be changed? The simpleanswer to the first question is that the coefficients of terms that predicted the final outcomeaccurately should be increased, while the coefficients of poor predictors should be decreased. Insome domains, this is easy to do. If a pattern classification program uses its evaluation function toclassify an input and it gets the right answer, then all the terms that predicted that answer shouldhave their weights increased. But in game-playing programs, the problem is more difficult. Theprogram does not get any concrete feedback from individual moves. It does not find out for sure
  • 9. 186 ARTIFICIAL INTELLIGENCEuntil the end of the game whether it has won. But many moves have contributed to that finaloutcome. Even if the program wins, it may have made some bad moves along the way. Theproblem of appropriately assigning responsibility to each of the steps that led to a single outcomeis known as the credit assignment problem.Samuels program exploits one technique, albeit imperfect, for solving this problem. Assume thatthe initial values chosen for the coefficients are good enough that the total evaluation functionproduces values that are fairly reasonable measures of the correct score even if they are not asaccurate as we hope to get them. Then this evaluation function can be used to provide feedbackto itself. Move sequences that lead to positions with higher values can be considered good (andthe terms in the evaluation function that suggested them can be reinforced).Because of the limitations of this approach, however, Samuels program did two other things, oneof which provided an additional test that progress was being made and the other of whichgenerated additional nudges to keep the process out of a rut.When the program was in learning mode, it played against another copy of itself. Only one of thecopies altered its scoring function during the game; the other remained fixed. At the end of thegame, if the copy with the modified function won, then the modified function was accepted.Otherwise, the old one was retained. If, however, this happened very many times, then somedrastic change was made to the function in an attempt to get the process going in a moreprofitable direction.Periodically, one term in the scoring function was eliminated and replaced by another. This waspossible because, although the program used only sixteen features at anyone time, it actuallyknew about thirty-eight. This replacement differed from the rest of the learning procedure since itcreated a sudden change in the scoring function rather than a gradual shift in its weights.This process of learning by successive modifications to the weights of terms in a scoring functionhas many limitations, mostly arising out of its lack of exploitation of any knowledge about thestructure of the problem with which it is dealing and the logical relationships among the problemscomponents. In addition, because the learning procedure is a variety of hill climbing, it suffersfrom the same difficulties, as do other hill-climbing programs. Parameter adjustment is certainlynot a solution to the overall learning problem. But it is often a useful technique, either in situationswhere very little additional knowledge is available or in programs in which it is combined withmore knowledge-intensive methods.
  • 10. LEARNING 187Learning with Macro-OperatorsWe saw that rote learning was used in the context of a checker-playing program. Similartechniques can be used in more general problem-solving programs. The idea is the same: toavoid expensive recomputation. For example, suppose you are faced with the problem of gettingto the downtown post office. Your solution may involve getting in your car, starting it, and drivingalong a certain route. Substantial planning may go into choosing the appropriate route, but youneed not plan about how to go about starting your car. You are free to treat START-CAR as anatomic action, even though it really consists of several actions: sitting down, adjusting the mirror,inserting the key, and turning the key. Sequences of actions that can be treated as a whole arecalled macro-operators.Macro-operators were used in the early problem-solving system STRIPS. After each problem-solving episode, the learning component takes the computed plan and stores it away as a macro-operator, or MACROP. A MACROP is just like a regular operator except that it consists of asequence of actions. Not just a single one. A MACROPs preconditions are the initial conditions ofthe problem just solved and its postconditions correspond to the goal just achieved. In its simplestform, the caching of previously computed plans is similar to rote learning.Suppose we are given an initial blocks world situation in which ON(C, B) and ON(A, Table) areboth true. STRIPS can achieve the goal ON(A, B) by devising a plan with the four stepsUNSTACK(C, B). PUTDOWN(C), PICKUP (A), STACK(A, B). A STRIP now builds a MACROPwith preconditions ON(C, B). ON(A, Table) and postconditions ON(C, Table). ON(A, B). The bodyof the MACROP consists of the four steps just mentioned. In future planning, STRIPS is free touse this complex macro-operator just as it would use any other operator.But rarely will STRIPS see the exact same problem twice. New problems will differ from previousproblems. We would still like the problem solver to make efficient use of the knowledge it gainedfrom its previous experiences. By generalizing MACROPs before storing them, STRIPS is able toaccomplish this. The simplest idea for generalization is to replace all of the constants in themacro-operator by variables. Instead of storing the MACROP described in the previousparagraph, STRIPS can generalize the plan to consist of the steps UNSTACK(X1.X2).PUTDOWN(X1). PICKUP(X3), STACK(X3, X2). where X1. X2. and X3 are variables. This plan canthen be stored with preconditions ON(X1, X2), ON(X3. Table) and postconditions ON(X1, Table).ON(X2, X3). Such a MACROP can now apply in a variety of situations.
  • 11. 188 ARTIFICIAL INTELLIGENCEGeneralization is not so easy. Sometimes constants must retain their specific values. Supposeour domain included an operator called STACK-ON-B(X). With preconditions that both X and Bbe clear, and with postcondition ON(X, B).STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK- ~ ON-B(A). Letsgeneralize this plan and store it as a MACROP. The precondition " becomes ON(X3, X2), thepostcondition becomes ON(X1, X2), and the plan itself becomes UNSTACK(X3, X2),PUTDOWN(X3), STACK-ON-B(X(). Now, suppose we encounter a slightly different problem:The generalized MACROP we just stored seems well-suited to solving this problem if we let X1 =A, X2 = C, and X3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C),PUTDOWN(E), STACK-ON-B(A). But this plan does not work. The problem is that the postcondition of the MACROP is over generalized. This operation is only useful for stacking blocksonto B, which is not what we need in this new example. In this case, this difficulty will bediscovered when the last step is attempted. Although we cleared C, which is where we wanted toput A, we failed to clear B, which is where the MACROP is going to try to put it. Since B is notclear, STACK-ON-B cannot be executed, If B had happened to be clear, the MACROP wouldhave executed to completion, but it would not have accomplished the stated goal.In reality, STRIPS uses a more complex generalization procedure. First, all constants arereplaced by variables. Then, for each operator in the parameterized plan, STRIPS reevaluates itspreconditions. In our example, the preconditions of steps I and 2 are satisfied, but the only way toensure that B is clear for step 3 is to assume that block X2, which was cleared by the UNSTACKoperator, is actually block B. Through "re- proving" that the generalized plan works, STRIPSlocates constraints of this kind.It turns out that the set of problems for which macro-operators are critical are exactly thoseproblems with nonserializable subgoals. Nonserializability means that working on one subgoal willnecessarily interfere with the previous solution to another subgoal. Recall that we discussed suchproblems in connection with nonlinear planning. Macro- operators can be useful in such cases,since one macro-operator can produce a small global change in the world, even though theindividual operators that make it up produce many undesirable local changes.For example, consider the 8-puzzle. Once a program has correctly placed the first four tiles, it isdifficult to place the fifth tile without disturbing the first four. Because disturbing previously solvedsubgoals is detected as a bad thing by heuristic scoring, functions, it is strongly resisted. For
  • 12. LEARNING 189many problems, including the 8-puzzle and Rubiks cube, weak methods based on heuristicscoring are therefore insufficient. Hence, we either need domain-specific knowledge, or else anew weak method. Fortunately, we can learn the domain-specific knowledge we need in the formof macro-operators. Thus, macro-operators can be viewed as a weak method for learning. In the8-puzzle, for example, we might have a macro a complex, prestored sequence of operators-forplacing the fifth tile without disturbing any of the first four tiles externally (although in fact they aredisturbed within the macro itself). This approach contrasts with STRIPS, which learned itsMACROPs gradually, from experience. Korfs algorithm runs in time proportional to the time ittakes to solve a single problem without macro-operators.Learning by ChunkingChunking is a process similar in flavor to macro-operators. The idea of chunking comes from thepsychological literature on memory and problem solving. SOAR also exploits chunking so that itsperformance can increase with experience. In fact, the designers of SOAR hypothesize thatchunking is a universal learning method, i.e., it can account for all types of learning in intelligentsystems.SOAR solves problems by firing productions, which are stored in long-term memory. Some ofthose firings turn out to be more useful than others. When SOAR detects a useful sequence ofproduction firings, it creates a chunk, which is essentially a large, production that does the work ofan entile sequence of smaller ones. As in MACROPs, chunks are generalized before they arestored.Problems like choosing which subgoals to tackle and which operators to try (i.e., search controlproblems) are solved with the same mechanisms as problems in the original problem space.Because the problem solving is uniform, chunking can be used to learn general search controlknowledge in addition to operator sequences. For example, if SOAR tries several differentoperators, but only one leads to a useful path in the search space, then SOAR builds productionsthat help it choose operators more wisely in the future. SOAR has used chunking to replicate themacro-operator results described in the last section. In solving the 8-puzzle, for example, SOARlearns how to place a given tile without permanently disturbing the previously placed tiles. Giventhe way that SOAR learns, several chunks may encode a single macro-operator, and one chunkmay participate in a number of macro sequences. Chunks are generally applicable toward anygoal state. This contrasts with macro tables, which are structured towards reaching a particulargoal state from any initial state. Also, chunking emphasizes how learning can occur during
  • 13. 190 ARTIFICIAL INTELLIGENCEproblem solving, while macro tables are usually built during a preprocessing stage. As a result,SOAR is able to learn within trials as well as across trials. Chunks learned during the initial stagesof solving a problem are applicable in the later stages of the same problem-solving episode. Aftera solution is found, the chunks remain in memory, ready for use in the next problem.The price that SOAR pays for this generality and flexibility is speed. At present, chunking isinadequate for duplicating the contents of large, directly computed macro-operator tables.The Utility ProblemPRODIGY employs several learning mechanisms. One mechanism uses explanation-based learning(EBL), a learning method we discuss in PRODIGY can examine a trace of its own problem-solving behavior and try to explain why certain paths failed. The program uses those explanationsto formulate control rules that help the problem solver avoid those paths in the future. So whileSOAR learns primarily from examples of successful problem solving, PRODIGY also learns fromits failures.A major contribution of the work on EBL in PRODIGY was the identification of the utility problem inlearning systems. While new search control knowledge can be of great benefit in solving futureproblems efficiently, there are also some drawbacks. The learned control rules can take up largeamounts of memory and the search program must take the time to consider each rule at eachstep during problem solving. Considering a control rule amounts to seeing if its postconditions aredesirable and seeing if its preconditions are satisfied. This is a time-consuming process. So whilelearned rules may reduce problem-solving time by directing the search more carefully, they mayalso increase problem-solving time by forcing the problem solver to consider them. If we onlywant to minimize the number of node expansions in the search space, then the more control ruleswe learn, the better. But if we want to minimize the total CPU time required to solve a problem,we must consider this trade-off.PRODIGY maintains a utility measure for each control rule. This measure takes into account theaverage savings provided by the rule, the frequency of its application, and the cost of matching it.If a proposed rule has a negative utility, it is discarded (or "forgotten"). If not, it is placed in long-term memory with the other rules. It is then monitored during subsequent problem solving. If itsutility falls, the rule is discarded. Empirical experiments have demonstrated the effectiveness ofkeeping only those control rules with high utility. Utility considerations apply to a wide range of AI
  • 14. LEARNING 191learning systems. For example, for a discussion of how to deal with large, expensive chunks inSOAR.TopLearning by InductionClassification is the process of assigning, to a particular input, the name of a class to which itbelongs. The classes from which the classification procedure can choose can be described in avariety of ways. Their definition will depend on the use to which they will be put.Classification is an important component of many problem-solving tasks. In its simplest form, it ispresented as a straightforward recognition task. An example of this is the question "What letter ofthe alphabet is this?" But often classification is embedded inside another operation. To see howthis can happen, consider a problem-solving system that contains the following production rule: If: the current goal is to get from place A to place B, and there is a WALL separating the two places then: look for a DOORWAY in the WALL and go through it.To use this rule successfully, the systems matching routine must be able to identify an object asa wall. Without this, the rule can never be invoked. Then, to apply the rule, the system must beable to recognize a doorway.Before classification can be done, the classes it will use must be defined. This can be done in avariety of ways, including:Isolate a set of features that are relevant to the task domain. Define each class by a weightedsum of values of these features. Each class is then defined by a scoring function that looks verysimilar to the scoring functions often used in other situations, such as game playing. Such afunction has the form. C1t1 + C2V2 + C3t3 + ...Each t corresponds to a value of a relevant parameter, and each c represents the weight to beattached to the corresponding t. Negative weights can be used to indicate features whosepresence usually constitutes negative evidence for a given class.
  • 15. 192 ARTIFICIAL INTELLIGENCEFor example, if the task is weather prediction, the parameters can be such measurements asrainfall and location of cold fronts. Different functions can be written to combine these parametersto predict sunny, cloudy, rainy, or snowy weather.Isolate a set of features that are relevant to the task domain. Define each class as a structurecomposed of those features. For example, if the task is to identify animals, the body of each typeof animal can be stored as a structure, with various features representing such things as color,length of neck, and feathers.There are advantages and disadvantages to each of these general approaches. The statisticalapproach taken by the first scheme presented here is often more efficient than the structuralapproach taken by the second. But the second is more flexible and more extensible.Regardless of the way that classes are to be described, it is often difficult to construct, by hand,good class definitions. This is particularly true in domains that are not well understood or thatchange rapidly. Thus the idea of producing a classification program that can evolve its own classdefinitions is appealing. This task of constructing class definitions is called concept learning, orinduction. The techniques used for this task J must, of course, depend on the way that classes(concepts) are described. If classes are described by scoring functions, then concept learning canbe done using the technique of coefficient adjustment. If, however, we want to define classesstructurally, some other technique for learning class definitions is necessary. In this section, wepresent three such techniques.Winstons Learning ProgramWinston describes an early structural concept-learning program. This program operated in asimple blocks world domain. Its goal was to construct representations of the definitions ofconcepts in the blocks domain. For example, it learned the concepts: House, Tent, and Arch shownin Figure 7.2. The figure also shows an example of a near miss for each concept. A near miss is anobject that is not an instance of the concept in question but that is very similar to such instances.The program started with a line drawing of a blocks world structure. It used procedures to analyzethe drawing and construct a semantic net representation of the structural description of theobject(s). This structural description was then provided as input to the learning program. Anexample of such a structural description for the House of Figure 7.2 is shown in Figure 7.3(a). NodeA represents the entire structure, which is composed of two parts: node B, a Wedge, and node C, aBrick. Figures 7.3(b) and 7.3(c) show descriptions of the two Arch structures of Figure 7.2. These
  • 16. LEARNING 193descriptions are identical except for the types of the objects on the top, one is a Brick while theother is a Wedge. Notice that the two supporting objects are related not only by left-of and right-oflinks, but also by a does-not-marry link, which says that the two objects do not marry. Two objectsmarry if they have faces that touch, and they have a common edge. The marry relation is critical inthe definition of an Arch. It is the difference between the first arch structure and the near missarch structure shown in Figure 7.2. Figure 7.2: Some Blocks World ConceptsThe basic approach that Winstons program took to the problem of concept formation can bedescribed as follows:1. Begin with a structural description of one known instance of the concept. Call that description the concept definition.2. Examine descriptions of other known instances of the concept. Generalize the definition to include them.3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these. Steps 2 and 3 of this procedure can be interleaved.Steps 2 and 3 of this procedure rely heavily on a comparison process by which similarities anddifferences between structures can be detected. This process must function in much the sameway as does any other matching process, such as one to determine whether a given productionrule can be applied to a particular problem state. Because differences as well as similarities must
  • 17. 194 ARTIFICIAL INTELLIGENCEbe found. The procedure must perform not just literal but also approximate matching. The outputof the comparison procedure is a skeleton structure describing the commonalities between thetwo input structures. It is annotated with a set of comparison notes that describe specificsimilarities and differences between the inputs.
  • 18. LEARNING 195To see how this approach works, we trace it through the process of learning what an arch is.Suppose that the arch description of Figure 7 .3(b) is presented first. It then becomes the definitionof the concept Arch. Then suppose that the arch description of Figure 7.3(c) is presented. The
  • 19. 196 ARTIFICIAL INTELLIGENCEcomparison routine will return a structure similar to the two input structures except that it will notethat the objects represented by the nodes labeled C are not identical. This structure is shown asFigure 7.4. The c-note link from node C describes the difference found by the comparison routine.It notes that the difference occurred in the isa link, and that in the first structure the isa link pointedto Brick, and in the second it pointed to Wedge. It also notes that if we were to follow isa links fromBrick and Wedge, these links would eventually merge. At this point, a new, description of theconcept Arch can be generated. This description could say simply that node C must be either aBrick or a Wedge. But since this particular disjunction has no previously known significance, it isprobably better to trace up the isa hierarchies of Brick and Wedge until they merge. Assuming thatthat happens at the node Object, the Arch definition shown in Figure 7.4 can be built. Figure 7.3: The Structural Description Figure 7.4 Figuer 7.4
  • 20. LEARNING 197 Figure 7.5: The Arch Description after Two ExamplesNext, suppose that the near miss arch is presented. This time, the comparison routine will notethat the only difference between the current definition and the near miss is in the does-not-marry linkbetween nodes B and D. But since this is a near miss, we do not want to broaden the definition toinclude it. Instead, we want to restrict the definition so that it is specifically excluded. To do this,we modify the link does-not-marry, which may simply be recording something that has happened bychance to be true of the small number of examples that have been presented. It must now saymust-not-marry. Actually, must-not-marry should not be a completely new link. There must be somestructure among link types to reflect the relationships between marry, does-not-marry, and must-not-marry.Notice how the problem-solving and knowledge representation techniques we covered in earlierchapters are brought to bear on the problem of learning. Semantic networks were used todescribe block structures, and an isa hierarchy was used to describe relationships among alreadyknown objects. A matching process was used to detect similarities and differences betweenstructures, and hill climbing allowed the program to evolve a more and more accurate conceptdefinition.This approach to structural concept learning is not without its problems. One major problem isthat a teacher must guide the learning program through a carefully chosen sequence ofexamples. In the next section, we explore a learning technique that is insensitive to the order inwhich examples are presented.Example Near Misses:
  • 21. 198 ARTIFICIAL INTELLIGENCEGoalWinstons Arch Learning Program models how we humans learn a concept such as "What is anArch"?This is done by presenting to the "student" (the program) examples and counter examples (whichare however very similar to examples). The latter are called "near misses."The program builds (Winstons version of) a semantic network. With every example and withevery near miss, the semantic network gets better at predicting what exactly an arch is.Learning Procedure W1. Let the description of the first sample which must be an example, be the initial description.2. For all other samples. - If the sample is a near-miss call SPECIALIZE - If the sample is an example call GENERALIZESpecialize• Establish parts of near-miss to parts of model matching• IF there is a single most important difference THEN: – If the model has a link that is not in the near miss, convert it into a MUST link – If the near miss has a link that is not in the model, create a MUST-NOT-LINK, ELSE ignore the near-miss.Generalize1. Establish a matching between parts of the sample and the model.2. For each difference:
  • 22. LEARNING 199 – If the link in the model points to a super/sub-class of the link in the sample a MUST-BE link is established at the join point. – If the links point to classes of an exhaustive set, drop the link from the model. – Otherwise create a new common class and a MUST-BE arc to it. – If a link is missing in model or sample, drop it in the other. – If numbers are involved, create an interval that contains both numbers. – Otherwise, ignore the difference.Because the only difference is the missing support, and it is near miss, that are must beimportant.New Model
  • 23. 200 ARTIFICIAL INTELLIGENCEThis uses specialize.(Actually it uses it twice, this is an exception).However, the following principles may be applied.1. The teacher must use near misses and examples judiciously.2. You cannot learn if you cannot know (=> KR)3 You cannot know if you cannot distinguish relevant from irrelevant features. (=> Teachers can help here, by new missing.)4. No-Guessing Principle: When there is doubt what to learn, learn nothing.5. Learning Conventions called "Felicity Conditions" help. The teacher knows that there should be only one relevant difference, and the Student uses it.6. No-Altering Principle: If an example fails the model, create an exception. (Penguins are birds).7. Martins Law: You cant learn anything unless you almost know it already.Learning Rules from Examples
  • 24. LEARNING 201Given are 8 instances:1. DECCG2. AGDBC3. GCFDC4. CDFDE5. CEGEC(Comment: "These are good")1. CGCGF2. DGFCD3. ECGCD(Comment: "These are bad")The following rules perfectly describes warnings:(M(1,C) ∧ M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F)) ∨(M(1,D) ∧ M(2,G) ∧ M(3,F) ∧ M(4,C) ∧ M(5,D)) ∨(M(1,E) ∧ M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) ->WarningNow we eliminate the first conjunct from all 3 disjuncts:(M(2,G) ∧ M(3,C) ∧ M(4,G) ∧ M(5,F) ) ∨(M(2,G) ∧ M(3,F) ∧ m(4,C) ∧ M(5,D)) ∨(M(2,C) ∧ M(3,G) ∧ M(4,C) ∧ M(5,D) ) -> WarningNow we check whether any of the good strings 1 - 5 would cause a warning.1, 4, 5 are out at the first position. (Actually it is the second now). 2, 3 are out at the next (third)position.Thus, I dont really need the first conjunct in the Warning rule!I keep repeating this dropping business, until I am left withM(5,F) V M(5,D)-> warning
  • 25. 202 ARTIFICIAL INTELLIGENCEThis rule contains all the knowledge contained in the examples!Note that this elimination approach is order dependant!! Working from right to left you would get:(M(1,C) ∧ M(2,6)) ∨ ( M(1,D) ∧ M(2,6)) ∨ M(1,E) -> warningTherefore the resulting rule is order dependent! There are N! (factorial!!!) orders we could try todelete features! Also there is no guarantee that the rules will work correctly on new instances. Infact rules are not even guaranteed to exist!To deal with N! problem, try to find "higher level features". For instance, if 2 features always occurtogether, replace them by one feature. Student Activity 7.2Before reading next section, answer the following questions.1. What do you mean by learning by induction. Explain with example?2. Describe learning by Macro Operators?3. Explain why a good knowledge representation is useful in reasoning knowledge.If your answers are correct, then proceed to the next section.TopExplanation-Based LearningThe previous section illustrated how we can induce concept descriptions from positive andnegative examples. Learning complex concepts using these procedures typically requires asubstantial number of training instances. But people seem to be able to learn quite a bit fromsingle examples. Consider a chess player who, as Black, has reached the position shown inFigure 7.6. The position is called a "fork" because the white knight attacks both the black king andthe black queen. Black must move the king, thereby leaving the queen open to capture. From thissingle experience, Black is able to learn quite a bit about the fork trap: the idea is that if any piecex attacks both the opponents king and another piece y, then piece y will be lost. We dont need tosee dozens of positive and negative examples of fork positions in order to draw theseconclusions. From just one experience, we can learn to avoid this trap in the future and perhapsto use it to our own advantage.
  • 26. LEARNING 203What makes such single-example learning possible? The answer, not surprisingly, is knowledge.The chess player has plenty of domain-specific knowledge that can be brought to bear, includingthe rules of chess and any previously acquired strategies. That knowledge can be used to identifythe critical aspects of the training example. In the case of the fork, we know that the doublesimultaneous attack is important while the precise position and type of the attacking piece is not.Much of the recent work in machine learning has moved away from the empirical, data-intensiveapproach described in the last section toward this more analytical, knowledge-intensive approach.A number of independent studies led to the characterization of this approach as explanation-basedlearning. An EBL system attempts to learn from a single example x by explaining why x is anexample of the target concept. The explanation is then generalized, and the systemsperformance is improved through the availability of this knowledge. Figure 7.6: A Fork Position in ChessMitchell et al. and DeJong and Mooney both describe general frame works for EBL programs andgive general learning algorithms. We can think of EBL programs as accepting the following asinput:1. A Training Example-What the learning program "sees" in the world?2. A Goal Concept-A high-level description of what the program is supposed to learn3. An Operationality Criterion-A description of which concepts are usable4. A Domain Theory-A set of rules that describe relationships between objects and actions in a domain.
  • 27. 204 ARTIFICIAL INTELLIGENCEFrom this, EBL computes a generalization of the training example that is sufficient to describe thegoal concept, and also satisfies the operationality criterion. Lets look more closely at thisspecification. The training example is a familiar input-it is the same thing as the example in theversion space algorithm. The goal concept is also familiar, but in previous sections, we haveviewed the goal concept as an ~ output of the program, not an input. The assumption here is thatthe goal concept is not operational, just like the high-level card-playing advice. An EBL programseeks to operationalize the goal concept by expressing it in terms that a problem-solving programcan understand. These terms are given by the operationality criterion. In the chess example, thegoal concept might be something like "bad position for Black," and the operationalized conceptwould be a generalized description of situations similar to the training example, given in terms ofpieces and their relative positions. The last input to an EBL program is a domain theory, in ourcase, the rules of chess. Without such knowledge, it is impossible to come up with a correctgeneralization of the training example.Explanation-based generalization (EBG) is an algorithm for EBL described in Mitchell et al. It has twosteps: (1) explain and (2) generalize. During the first step, the domain theory is used to pruneaway all the unimportant aspects of the training example with respect to the goal concept. What isleft is an explanation of why the training example is an instance of the goal concept. Thisexplanation is expressed in terms that satisfy the operationality criterion. The next step is togeneralize the explanation as far as possible while still describing the goal concept. Following ourchess example, the first EBL step chooses to ignore Whites pawns, king, and rook, andconstructs an explanation consisting of Whites knight, Blacks king, and Blacks queen, each intheir specific positions. Operationality is ensured: all chess-playing programs understand thebasic concepts of piece and position. Next, the explanation is generalized. Using domainknowledge, we find that moving the pieces to a different part of the board is still bad for Black. Wecan also determine that other pieces besides knights and queens can participate in fork attacks.In reality, current EBL methods run into difficulties in domains as complex as chess, so we will notpursue this example further. Instead, lets look at a simpler case. Consider the problem oflearning the concept Cup. Unlike the arch-learning program we want to be able to generalize froma single example of a cup. Suppose the example is:Training Example:owner(Object23, Ralph) ∧ has-part(Object23, Concavity12) ∧
  • 28. LEARNING 205is(Object23, Light) ∧ color(Object23, Brown) ∧ ...Clearly, some of the features of Object23 are more relevant to its being a cup than others. So far inthis unit we have seen several methods for isolating relevant features. All these methods requiremany positive and negative examples. In EBL we instead rely on domain knowledge, such as:Domain Knowledge:is(x, Light) ∧ has-part(x, y) ∧ isa(y, Handle) → liftable(x)has-part(x, y) ∧ isa(y, Bottom) ∧ is(y, Flat) → stable(x)has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) → open-vessel(x)We also need a goal concept to operationalize:Goal Concept: Cupx is a Cup if x is liftable, stable, and open-vessel.OperationalityCriterion: Concept definition must be expressed in purely structural terms (e.g.,Light, Flat, etc.).Given a training example and a functional description, we want to build a general structuraldescription of a cup. The first step is to explain why Object23 is a cup. We do this by constructing aproof, as shown in Figure 7.7. Standard theorem-proving techniques can be used to find such aproof. Notice that the proof isolates the relevant features of the training example; nowhere in theproof do the predicates owner and color appear. The proof also serves as a basis for a validgeneralization. If we gather up all the assumptions and replace constants with variables, we getthe following description of a cup:has-part(x, y) ∧ isa(y, Concavity) ∧ is(y, Upward-Pointing) ∧has-part(x, z) ∧ isa(z, Bottom) ∧ is(z, Flat) ∧has-part(x, w) ∧ isa(w, Handle) ∧ is(x, Light)This definition satisfies the operationality criterion and could be used by a robot to classifyobjects.
  • 29. 206 ARTIFICIAL INTELLIGENCE Figure 7.7: An ExplanationSimply replacing constants by variables worked in this example, but in some cases it is necessaryto retain certain constants. To catch these cases, we must reprove the goal. This process, whichwe saw earlier in our discussion of learning in STRIPS, is called goal regression.As we have seen, EBL depends strongly on a domain theory. Given such a theory, why areexamples needed at all? We could have operationalized the goal concept Cup without referenceto an example, since the domain theory contains all of the requisite information. The answer isthat examples help to focus the learning on relevant operationalizations. Without an example cup,EBL is faced with the task of characterizing the entire range of objects that satisfy the goalconcept. Most of these objects will never be encountered in the real world, and so the result willbe overly general.Providing a tractable domain theory is a difficult task. There is evidence that humans do not learnwith very primitive relations. Instead, they create incomplete and inconsistent domain theories.For example, returning to chess, such a theory might " include concepts like "weak pawnstructure." Getting EBL to work in ill-structured domain theories is an active area of research.EBL shares many features of all the learning methods described in earlier sections. , Like conceptlearning, EBL begins with a positive example of some concept. As in learning by advice taking,the goal is to operationalize some piece of knowledge. And EBL techniques, like the techniquesof chunking and macro-operators, are often used to improve the performance of problem-solvingengines. The major difference between EBL and other learning methods is that EBL programs arebuilt to take advantage of domain knowledge. Since learning is just another kind of problemsolving, it should come as no surprise that there is leverage to be found in knowledge.
  • 30. LEARNING 207TopLearning AutomationThe theory of learning automata was first introduced in 1961 (Tsetlin, 1961). Since that time thesesystems have been studied intensely, both analytically and through simulations (Lakshmivarahan,1981). Learning automata systems are finite set adaptive systems which interact iteratively with ageneral environment. Through a probabilistic trial-and-error response process they learn tochoose or adapt to a behaviour that produces the best response. They are, essentially, a form ofweak, inductive learners.In Figure 7.8, we see that the learning model for learning automata has been simplified for justtwo components, an automaton (learner) and an environment. The learning cycle begins with aninput to the learning automata system from the environment. This input elicits one of a finitenumber of possible responses and then provides some form of feedback to the automaton inreturn. The automaton to alter its stimulus-response mapping structure to improve its behaviour ina more favourable way uses this feedback.As a simple example, suppose a learning automata is being used to learn the best temperaturecontrol setting for your office each morning. It may select any one of ten temperature rangesettings at the beginning of each day (Figure 7.9). Without any prior knowledge of yourtemperature preferences, the automaton randomly selects a first setting using the probabilityvector corresponding to the temperature settings. Figure 7.8: Learning Automaton Model
  • 31. 208 ARTIFICIAL INTELLIGENCE Figure 7.9: Temperature Control ModelSince the probability values are uniformly distributed, any one of the settings will be selected withequal likelihood. After the selected temperature has stabilized, the environment may respond witha simple good-bad feedback response. If the response is good, the automata will modify itsprobability vector by rewarding the probability corresponding to the good setting with a positiveincrement and reducing all other probabilities proportionately to maintain the sum equal to 1. If theresponse is bad, the automaton will penalize the selected setting by reducing the probabilitycorresponding to the bad setting and increasing all other values proportionately. This process isrepeated each day until the good selections have high probability values and all bad choices havevalues near zero. Thereafter, the system will always choose the good settings. If, at some point,in the future your temperature preferences change, the automaton can easily readapt.Learning automata have been generalized and studied in various ways. One such generalizationhas been given the special name of collective learning automata (CLA). CLAs are standardlearning automata systems except that feedback is not provided to the automaton after eachresponse. In this case, several collective stimulus-response actions occur before feedback ispassed to the automaton. It has been argued (Bock, 1976) that this type of learning more closelyresembles that of human beings in that we usually perform a number or group of primitive actionsbefore receiving feedback on the performance of such actions, such as solving a completeproblem on a test or parking a car. We illustrate the operation of CLAs with an example oflearning to play the game of Nim in an optimal way.Nim is a two-person zero-sum game in which the players alternate in removing tokens from anarray that initially has nine tokens. The tokens are arranged into three rows with one token in thefirst row, three in the second row, and five in the third row (Figure 7.10).
  • 32. LEARNING 209 Figure 7.10: Nim Initial ConfigurationThe first player must remove at least one token but not more than all the tokens in any single row.Tokens can only be removed from a single row during each payer’s move. The second playerresponds by removing one or more tokens remaining in any row. Players alternate in this wayuntil all tokens have been removed; the loser is the player forced to remove the last token.We will use the triple (n1, n2, n3) to represent the states of the game at a given time where n 1, n2and n3 are the numbers of tokens in rows 1, 2, and 3, respectively. We will also use a matrix todetermine the moves made by the CLA for any given state. The matrix of Figure 7.11 has headingcolumns which correspond to the state of the game when it is the CLA’s turn to move, and rowheadings which correspond to the new game state after the CLA’s turn to move, and rowheadings which correspond to the new game state after the CLA has completed a move.Fractional entries in the matrix are transition probabilities used by the CLA to execute each of itsmoves. Asterrisks in the matrix represent invalid moves.Beginning with the initial state (1, 3, 5), suppose the CLA’s opponent removes two tokens fromthe third row resulting in the new state (1, 3, 3). If the ClA then removes all three tokens from thesecond row, the resultant state is (1, 0, 3). Suppose the opponent now removes all remainingtokens from the third row. This leaves the CLA with a losing configuration of (1, 0, 0). Figure 7.11: CLA Internal Representation of Game StatesAs the start of the learning sequence, the matrix is initialized such that the elements in eachcolumn are equal (uniform) probability values. For example, since there are eight valid moves
  • 33. 210 ARTIFICIAL INTELLIGENCEfrom the state (1, 3, 4) each column element under this state corresponding to a valid move hasbeen given uniform probability values corresponding to all valid moves for the given column state.The CLA selects moves probabilistically using the probability values in each column. So, forexample, if the CLA had the first move, any row intersecting with the first column not containing 1an asterisk would be chosen with probability 9 . This choice then determines the new gamestate from which the opponent must select a move. The opponent might have a similar matrix torecord game states and choose moves. A complete game is played before the CLA is given anyfeedback, at which time it is informed whether or not its responses were good or bad. This is thecollective feature of the CLA.If the CLA wins a game, increasing the probability value in each column corresponding to thewinning move rewards all moves made by the CLA during that game. All non-winning probabilitiesin those columns are reduced equally to keep the sum in each column equal to 1. If the CLA losesa game, reducing the probability values corresponding to each losing move penalizes the movesleading to that loss. All other probabilities in the columns having a losing move are increasedequally to keep the column totals equal to 1.After a number of games have been played by the CLA, the matrix elements that correspond torepeated wins will increase toward one, while all other elements in the column will decreasetoward zero. Consequently, the CLA will choose the winning moves more frequently and therebyimprove its performance.Simulated games between a CLA and various types of opponents have been performed and theresults plotted (Bock, 1985). It was shown, for example, that two CLAs playing against each otherrequired about 300 games before each learned to play optimally. Note, however, thatconvergence to optimality can be accomplished with fewer games if the opponent always playsoptimally (or poorly), since, in such a case, the CLA will repeatedly lose (win) and quickly reduce(increase) the losing (winning) move elements to zero (one). It is also possible to speed up thelearning process through the use of other techniques such as learned heuristics.Learning systems based on the learning automaton or CLA paradigm are fairly general forapplications in which a suitable state representation scheme can be found. They are also quiterobust learners. In fact, it has been shown that an LA will converge to an optimal distributionunder fairly general conditions if the feedback is accurate with probability greater 0.5 (Narendra
  • 34. LEARNING 211and Thathachar, 1974). Of course, the rate of convergence is strongly dependent on the reliabilityof the feedback.Learning automata are not very efficient learners as was noted in the game-playing exampleabove. They are, however, relatively easy to implement, provided the number of states is not toolarge. When the number of states becomes large, the amount of storage and the computationrequired to update the transition matrix becomes excessive.Potential applications for learning automata include adaptive telephone routing and control. Suchapplications have been studied using simulation programs (Narendra et al., 1977).Genetic AlgorithmsGenetic algorithm learning methods are based on models of natural adaptation and evolution.These learning systems improve their performance through processes that model populationgenetics and survival of the fittest. They have been studied since the early 1960s (Holland, 1962,1975).In the field of genetics, a population is subjected to an environment that places demands on themembers. The members that adapt well are selected for mating and reproduction. The offspringof these better performers inherit genetic traits from both their parents. Members of this secondgeneration of offspring, which also adapt well, are then selected for mating and reproduction andthe evolutionary cycle continues. Poor performers die off without leaving offspring. Goodperformers produce good offspring and they, in turn, perform well. After some number ofgenerations, the resultant population will have adapted optimally or at least very well to theenvironment.Genetic algorithm systems start with a fixed size population of data structures that are used toperform some given tasks. After requiring the structures to execute the specified tasks somenumber of times, the structures are rated on their performance, and a new generation of datastructures is then created. Mating the higher performing structures to produce offspring createsthe new generation. These offspring and their parents are then retained for the next generationswhile the poorer performing structures are discarded. The basic cycle is illustrated in Figure 7.12.Mutations are also performed on the best performing structures to insure that the full space ofpossible structures is reachable. This process is repeated for a number of generations until theresultant population consists of only the highest performing structures.
  • 35. 212 ARTIFICIAL INTELLIGENCEData structures that make up the population can represent rules or any other suitable types ofknowledge structure. To illustrate the genetic aspects of the problem, assume for simplicity thatthe populations of structures are fixed-length binary strings such as the eight-bit string 11010001.An initial population of these eight-bit strings would be generated randomly or with the use ofheuristics at time zero. These strings, which might be simple condition an action rules, would thenbe assigned some tasks to perform (like predicting the weather based on certain physical andgeographic conditions or diagnosing a fault in a piece of equipment). Figure 7.12: Genetic AlgorithmAfter multiple attempts at executing the tasks, each of the participating structures would be ratedand tagged with a utility value, u, commensurate with its performance. The next population wouldthen be generated using the higher performing structures as parents and the process would berepeated with the newly produced generation. After many generations the remaining populationstructures should perform the desired tasks well.Mating between two strings is accomplished with the crossover operation which randomly selectsa bit position in the eight-bit string and concatenates the head of one parent to the tail of thesecond parent to produce the offspring. Suppose the two parents are designated as xxxxxxxx andyyyyyyyy respectively, and suppose the third bit position has been selected as the crossover point(at the position of the colon in the structure xxx: xxxxx). After the crossover operation is applied,
  • 36. LEARNING 213two offspring are then generated, namely xxxyyyyy and yyyxxxxx. Such offspring and theirparents are then used to make up the next generation of structures.A second genetic operation often used is called inversion. Inversion is a transformation applied toa single string. A bit position is selected at random, and when applied to a structure; the inversionoperation concatenates the tail of the string to the head of the same string. Thus, if the sixthposition were selected (x1x2x3x4x5x6x7x8), the inverted string would be x7x8x1x2x3x4x5x6.A third operator, mutation, is used to insure that all locations of the rule space are reachable, thatevery potential rule in the rule space is available for evaluation. This insures that the selectionprocess does not get caught in a local minimum. For example, it may happen that use of thecrossover and inversion operators will only produce a set of structures that are better than alllocal neighbors but not optimal in a global sense. This can happen since crossover and inversionmay not be able to produce some undiscovered structures. The mutation operators can overcomethis by simply selecting any bit position in a string at random and changing it. This operator istypically used only infrequently to prevent random wandering in the search space.The genetic paradigm is best understood through an example. To illustrate similarities betweenthe learning automation paradigm and the genetic paradigm we use the same learning task of theprevious section, namely learning to play the game of Nim optimally. We use a slightly differentrepresentation scheme here since we want a population of structures that are easily transformed.To do this, we let each member of the population consist of a pair of triplets augmented with autility value u, ((n1, n2, n3) (m1, m2, m3) u), where the first pair is the game state presented to thegenetic algorithm system prior to its move, and the second triple is the state after the move. The uvalues represent the worth or current utility of the structure at any given time.Before the game begins, the genetic system randomly generates an initial population of K triple-pair members. The population size K is one of the important parameters that must be selected.Here, we simply assume it is about 25 or 30, which should be more than the number of movesneeded for any optimal play. All members are assigned an initial utility value of 0. The learningprocess then proceeds as follows:1. The environment presents a valid triple to the genetic system.2. The genetic system searches for all population triple pairs that have a first triple that matches the input triple. From those that match, the first one found having the highest utility value u is selected and the second triple is returned as the new game state. If no match is found, the
  • 37. 214 ARTIFICIAL INTELLIGENCE genetic system randomly generates a triple which represents a valid move, returns this as the new state, and stores the triple pairs, the input, and newly generated triple as an addition to the population.3. The above two steps are repeated until the game is terminated, in which case the genetic system is informed whether a win or loss occurred. If the system wins, each of the participating member moves has its utility value increased. If the system loses, each participating member has its utility value decreased.4. The above steps are repeated until a fixed number of games have been played. At this time a new generation is created.The new generation is created from the old population by first selecting a fraction (say one half) ofthe members having the highest utility values. From these, offspring are obtained by applicationof appropriate genetic operators.The three operators, crossover, inversion, and mutation, randomly modify the parent moves togive new offspring move sequences. (The best choice of genetic operators to apply in thisexample is left as an exercise). Each offspring inherits a utility value from one of the parents.Population members having low utility values are discarded to keep the population size fixed.This whole process is repeated until the genetic system has learned all the optimal moves. Thiscan be determined when the parent population ceases to change or when the genetic systemrepeatedly wins.The similarity between the learning automaton and genetic paradigms should be apparent fromthis example. Both rely on random move sequences, and the better moves are rewarded whilethe poorer ones are penalized.Neural NetworkNeural networks are large networks of simple processing elements or nodes that processinformation dynamically in response to external inputs. The nodes are simplified models ofneurons. The knowledge in a neural network is distributed throughout the network in the form ofinternode connections and weighted links that form the inputs to the nodes. The link weightsserve to enhance or inhibit the input stimuli values that are then added together at the nodes. If
  • 38. LEARNING 215the sum of all the inputs to a node exceeds some threshold value T, the node executes andproduces an output that is passed on to other nodes or is used to produce some output response.In the simplest case, no output is produced if the total input is less than T. In more complexmodels, the output will depend on a nonlinear activation function.Neural networks were originally inspired as being models of the human nervous system. They aregreatly simplified models to be sure (neurons are known to be fairly complex processors). Evenso, they have been shown to exhibit many "intelligent" abilities, such as learning, generalization,and abstraction.A single node is illustrated in Figure 7.13. The inputs to the node are the values X 1, X2 . . . , Xnwhich typically take on values of -1, 0, 1, or real values within the range (—1,1). The weights w1,w2,. . . ., wn, correspond to the synaptic strengths of a neuron. They serve to increase or decreasethe effects of the corresponding x, input values. The sum of the products x i × wi i = 1,2,. . . , n,serve as the total combined input to the node. If this sum is large enough to exceed the thresholdamount T, the node fires, and produces an output y, an activation function value placed on thenode’s output links. The output may then be the input to other nodes or the final output responsefrom the network. Figure 7.13: Model of a single neuron (node)Figure 7.14 illustrates three layers of a number of interconnected nodes. The first layer serves asthe input layer, receiving inputs from some set of stimuli. The second layer (called the hiddenlayer layer) receive inputs from the first layer and produces a pattern of inputs to the third layer,the output layer. The. pattern of outputs from the final layer are the networks responses to theinput stimuli patterns. Input links to layer j(j=1, 2, 3) have weights wij for I=1, 2,…….,n.General multiplayer networks having n nodes (number of rows) in each of m layers (number ofcolumns of nodes) will have weights represented as an n x m matrix W. Using this representation,nodes having no interconnecting links will have a weight value of zero. Networks consisting of
  • 39. 216 ARTIFICIAL INTELLIGENCEmore than three layers would, of course, be correspondingly more complex than the networkdepicted in Figure 7.14.A neural network can be thought of as a black box that transforms the input vector x to the outputvector y where the transformation performed is the result of the pattern of connections andweights, that is, according to the values of the weight matrix W.Consider the vector product | x × w | = ∑x w i iThere is a geometric interpretation for this product. It is equivalent to projecting one vector ontothe other vector in n-dimensional space. This notion is depicted in Figure 7.15 for the two-dimensional case. The magnitude of the resultant vector is given by x × w = x || w | cos θ |here |x| denotes the norm or length of the vector x. Note that this product is maximum when bothvectors point in the same direction, that is, when θ = 0. The product is a minimum when bothpoint in opposite directions or when θ=180 degrees. This illustrates how the vectors in theweight matrix W influence the inputs to the nodes in a neutral network. Figure 7.14: A Multiplayer Neural Network
  • 40. LEARNING 217 Figure 7.15: Vector Multiplication is Like Vector ProjectonLearning pattern weightsThe interconnections and weights W in the neural network store the knowledge possessed by thenetwork. These weights must be preset or learned in some manner. When learning is used, theprocess may be either supervised or unsupervised. In the supervised case, learning is performedby repeatedly presenting the network with an input pattern and a desired output response. Thetraining examples then consist of the vector pairs (x, y), where x is the input pattern and y is thedesired output response pattern. The weights are then adjusted until the difference between theactual output response y and the desired response y are the same, that is until D = y- y is nearzero.One of the simpler supervised learning algorithms uses the following formula to adjust the weightsW. x Wnew = W + a × D × old | x| 2Where 0 < a < 1 is a learning constant that determines the rate of learning. When the difference Dis large, the adjustment to the weights W is large, but when the output response y is close to thetarget response y the adjustment will be small. When the difference D is near zero, the trainingprocess terminates at which point the network will produce the correct response for the giveninput patterns x.In unsupervised learning, the training examples consist of the input vectors x only. No desiredresponse y is available to guide the system. Instead, the learning process must find the weightsw, with no knowledge of the desired output response.
  • 41. 218 ARTIFICIAL INTELLIGENCEA neural net expert systemAn example of a simple neural network diagnostic expert system has been described by StephenGallant (1988). This system diagnoses and recommends treatments for acute sarcophagaldisease. The system is illustrated in Figure 7.16. From six symptom variables u 1, u2,……., u6, oneof two possible diseases can be diagnosed, u7 or ug. From the resultant diagnosis, one of threetreatments, u9, u10, or u11 can then be recommended. Figure 7.16: A Simple Neural Network Expert System.When a given symptom is present, the corresponding variable is given a value of +1 (true).Negative symptoms are given an input value of -1 (false), and unknown symptoms are given thevalue 0. Input symptom values are multiplied by their corresponding weights W i0 Numbers withinthe nodes are initial bias weights Wi0 and numbers on the links are the other node input weights.When the sum of the weighted products of the inputs exceeds 0, an output will be present on thecorresponding node output and serve as an input to the next layer of nodes.
  • 42. LEARNING 219As an example, suppose the patient has swollen feet ⇐ +1) but not red ears (U2 = -1) nor hairloss (U8 = -1). This gives a value of U 7 = +1 (since 0+(2)(l)+(-2)(-l)+(3)(-l) = 1), suggesting thepatient has superciliosis.When it is also known that the other symptoms of the patient are false (U 5 = U6 = l), it may beconcluded that namatosis is absent (U8 = -1), and therefore that birambio (U10 = +1) should beprescribed while placibin should not be prescribed (U9 = -1). In addition, it will be found thatposiboost should also be prescribed (u11 = +1).The training algorithm added the intermediate triangular shaped nodes. These additional nodesare needed so that weight assignments can be made which permit the computations to workcorrectly for all training instances.Deductions can be made just as well when only partial information is available. For example,when a patient has swollen feet and suffers from hair loss, it may be concluded the patient hassuperciliosis, regardless of whether or not the patient has red ears. This is so because theunknown variable cannot force the sum to change to negative.A system such as this can also explain how or why a conclusion was reached. For example,when inputs and outputs are regarded as rules, an output can be explained as the conclusion to arule. If placibin is true, the system might explain why with a statement such as:Placibin is TRUE due to the following rule: IF Ptacibin Alergy (u6) is FALSE, and Superciliosis is TRUE THEN Conclude Placibin is TRUE.Expert systems based on neural network architectures can be designed to possess many of thefeatures of other expert system types, including explanations for how and why, and confidenceestimation for variable deduction.TopLearning in Neural NetworksThe perception, an invention of Rosenblatt [1962], was one of the earliest neural network models.A perceptron models a neuron by taking a weighted sum of its inputs and sending the output 1ifthe sum is greater than some adjustable threshold value (otherwise it sends 0). Figure 7.17
  • 43. 220 ARTIFICIAL INTELLIGENCEshows the device. Notice that in a perceptron, unlike a Hopfield network, connections areunidirectional.The inputs (x1,x2....xn) and connection weights (w1, w2,... , wn) in the figure are typically real values,both positive and negative. If the presence of some feature x, tends to cause the perceptron tofire, the weight w, will be positive; if the feature x; inhibits the perceptron, the weight w, will benegative. The perceptron itself consists of the weights, the summation processor, and theadjustable threshold processor. Learning is a process of modifying the values of the weights andthe threshold. It is convenient to implement the threshold as just another weight WQ, as in Figure7.18. This weight can be thought of as the propensity of the perceptron to fire irrespective of itsinputs. The perceptron of Figure 7.18 fires if the weighted sum is greater than zero.A perceptron computes a binary function of its input. Several perceptrons can be combined tocompute more complex functions, as shown in Figure 7.19. Figure 7.17: A Neuron and a Perception Figure 7.18: Perceptron with Adjustable Threshold Implemented as Additional Weight
  • 44. LEARNING 221 Figure 7.19: A Perceptron with Many Inputs and Many OutputsSuch a group of perceptrons can be trained on sample input-output pairs until it learns to computethe correct function. The amazing property of perceptron learning is this: Whatever a perceptroncan compute, it can learn to compute. We demonstrate this in a moment. At the time perceptronswere invented, many people speculated that intelligent systems could be constructed out ofperceptrons (see Figure 7.20).Since the perceptrons of Figure 7.19 are independent of one another, they can be separatelytrained. So let us concentrate on what a single perceptron can learn to do. Consider the patternclassification problem shown in Figure 7.21. This problem is linearly separable, because we candraw a line that separates one class from another. Given values for x 1 and x2, we want to train aperceptron to output 1 if it thinks the input belongs to the class of white dots and 0 if it thinks theinput belongs to the class of black dots. We have no explicit rule to guide us; we must induce arule from a set of training instances. We now see how perceptrons can learn to solve suchproblems.First, it is necessary to take a close look at what the perceptron computes. Let x an input vector(x1, x2,……xn). Notice that the weighted summation function g(x) and the output function o(x) canbe defined as: n g(x) = ∑ w x i =0 i i 1 if g ( x ) > 0 0(x ) =  0 if g ( x ) > 0Consider the case where we have only two inputs (as in Figure 7.22). Then:
  • 45. 222 ARTIFICIAL INTELLIGENCE g (x ) = w 0 + w 1 x 1 + w 2x 2 Figure 7.20: An Early Notion of an Intelligent System Built from Trainable PerceptionsIf g(x) is exactly zero, the perceptron cannot decide whether to fire. A slight change in inputscould cause the device to go either way. If we solve the equation g(x) = 0, we get the equation fora line: w w x =− 1 x − 0 2 w 1 w 2 2The location of the line is completely determined by the weights w0, w1, and w2. lf an input vectorlies on one side of the line, the perceptron will output 1; if it lies on the other side, the perceptronwill output 0. A line that correctly separates the training instances corresponds to a perfectlyfunctioning perceptron. Such a line is called a decision surface. In perceptrons with many inputs,the decision surface will be a hyperplane through the multidimensional space of possible inputvectors. The problem of learning one of locating an appropriate decision surface.We present a formal learning algorithm later. For now, consider the informal rule: If the perceptron fires when it should not fire, make each w, smaller by an amount proportional to xi. If the perceptron fails to fire when it should fire, make each w i larger by a similar amount.Suppose we want to train a three-input perceptron to fire only when its first input is n. If theperceptron fails to fire in the presence of an active x 1, we will increase w1 (and we may increaseother weights). If the perceptron fires incorrectly, we will end up decreasing weights that are not
  • 46. LEARNING 223w1. (We will never decrease w1 because undesired firings only occur when x1 is 0, which forcesthe proportional change in w1 also to be 0). In addition, w0 will find a value based on the totalnumber of incorrect firings versus incorrect misfirings. Soon, w 1 will become large enough tooverpower w0, w2 and w3 will not be powerful enough to fire the perception, even in the presenceof both x2 and x3. Figure 7.21: A Linearly Separable Pattern Classification ProblemNow let us return to the function g(x) and o(x). While the sign of g(x) is critical to determiningwhether the perception will fire, the magnitude is also important. The absolute value of g(x) tells how far a given input vector x lies from the decision surface. This gives us a way of characterizing how good a set of weights is. Let w be the weight vector (w0, w1,…wn), and let Xbe the subset of training instances misclassified by the current set of weights. Then define theperceptron criterion function, J w ) , to be the sum of the distances of the misclassified input (vectors from the decision surface:   ( n J w ) = ∑ ∑Wi Xi = ∑ WX x ∈X i =0 x ∈X To create a better set of weights than the current set, we would like to reduce J w ) . Ultimately, if ( all inputs are classified correctly, J w ) = 0 . ( How do we go about minimizing J w ) ? We can use a form of local-search hill climbing known as ( gradient descent. For our current purposes, think of J w ) as defining a surface in the space of (all possible weights. Such a surface might look like the one in Figure 7.22.In the figure, weight w0 should be part of the weight space but is omitted here because it is easierto visualize J in only three dimensions. Now, some of the weight vectors constitute solutions, inthat a perceptron with such a weight vector will classify all its inputs correctly. Note that there are
  • 47. 224 ARTIFICIAL INTELLIGENCE  an infinite number of solution vectors. For any solution vector w s , we know that JWs = 0 . ( ) suppose we begin with a random weight vector w that is not a solution vector. We want to slidedown the J surface. There is a mathematical method for doing this—we compute the gradient of the function J w ) . Before we derive the gradient function, we reformulate the perception (criterion function to remove the absolute value sign:      x if x is misclassif ied as a negative example J( w ) = ∑ w    x ∈ X  – x if x is misclassif ied as a negative example ( Figure 7.22: Adjusting the Weights by Gradient Descent, Minimizing J w )Recall that X is the set of misclassified input vectors. Now, here is ∇, the gradient of J w ) with respect to the weight space: J (     x if X is misclassif ied as a negative example ∇ J( w ) = ∑    x ∈ X − x if x is misclassif ied as a positive example The gradient is a vector that tells us the direction to move in the weight space in order to reduce (J w ) . In order to find a solution weight vector, we simply change the weights in the direction of
  • 48. LEARNING 225  the gradient, recomputed J w ) , recomputed the new gradient, and iterate until J w ) = 0 . The ( (rule for updating the weights at time t + 1 is:   w t +1 = w t + η∇JOr in expanded form:     x if x is misclassfi ed as a negative example w t+ 1 = w t + η ∑    x ∈ X − x if x is misclassif ied as a positive example η is a scale factor that tells us how far to move in the direction of the gradient. A small η willlead to slower learning, but a large η to be a constant gives us what is usually called the “fixed-increment perceptron-learning algorithm”:1. Create a perception with n+1 inputs and n+1 weights, where the extra input X 0 is always set to 1.2. Initialize the weights (w0, w1,……wn) to random real values.3. Iterate through the training set, collecting all examples misclassified by the current set of weights.4. If all examples are classified correctly, output the weights and quit.5. Otherwise, compute the vector sum S of the misclassified input vectors, where each vector → → has the form {X0, X1,…..Xn). In creating the sum, add to S a vector x if x is an input for → → which the perceptron incorrectly fails lo fire, but add vector - x if x is an input for which the perceptron incorrectly fires. Multiply the sum by a scale factor η.6. Modify the weights (w0, w1…..wn) by adding the elements of the vector S to them. Go to step 3.The perceptron-learning algorithm is a search algorithm. It begins in a random initial state andfinds a solution state. The search space is simply all-possible assignments of real values to theweights of the perceptron, and the search strategy is gradient descent.So far, we have seen two search methods employed by neural networks, gradient descent inperceptrons and parallel relaxation in Hopfield networks. It is important to understand the relation
  • 49. 226 ARTIFICIAL INTELLIGENCEbetween the two. Parallel relaxation is a problem-solving strategy, analogous to state spacesearch in symbolic AI. Gradient descent is a learning strategy, analogous to techniques such asversion spaces. In both symbolic and connectionist AI, learning is viewed as a type of problemsolving, and this is why search is useful in learning. But the ultimate goal of learning is to get asystem into a position where it can solve problems better. Do not confuse learning algorithms withothers.The perceptron convergence theorem, due to Rosenblatt [1962], guarantees that the perceptronwill find a solution state, i.e., it will learn to classify any linearly separable set of inputs. In otherwords, the theorem shows that in the weight space, there are no local minima that do notcorrespond to the global minimum. Figure 7.23 shows a perceptron learning to classify theinstances of Figure 7.21. Remember that every set of weights specifies some decision surface, inthis case some two-dimensional line. In the figure, k is the number of passes through the trainingdata, i.e., the number of iterations of steps 3 through 6 of the fixed-increment perceptron-learningalgorithm.The introduction of perceptrons in the late 1950s created a great deal of excitement, here was adevice that strongly resembled a neuron and for which well-defined learning algorithms wasavailable. There was much speculation about how intelligent systems could be constructed fromperceptron building blocks. In their book Perceptrons, Minsky and Papert put an end to suchspeculation by analyzing the computational capabilities of the devices. They noticed that while theconvergence theorem guaranteed correct classification of linearly separable data, most problemsdo not supply such nice data. Indeed, the perceptron is incapable of learning to solve some verysimple problems. Figure 7.23: A Perceptron Learning to Solve a Classification Problem
  • 50. LEARNING 227The perception cannot learn a linear decision surface to separate these different outputs,because no such decision surface exists. No single line can separate the 1 outputs from the 0outputs. Minsky and Papert gave a number of problems with this property including tellingwhether a line drawing is connected, and separating figure from ground in a picture. Notice thatthe deficiency here is not in the perceptron-learning algorithm, but in the way the perceptionrepresents knowledge.If we could draw an elliptical decision surface, we could encircle the two “1” outputs in the XORspace. However, perceptrons are incapable of modeling such surfaces. Another idea is to employtwo separate line-drawing stages. We could draw one line to isolate the point ( x 1 = 1 , x 2 = 1 ) andthen another line to divide the remaining three points into two categories. Using this idea, we canconstruct a “multiplayer” perceptron (a series of perceptrons) to solve the problem.Note how the output of the first perceptron serves as one of the inputs to the second perceptron,with a large, negatively weighted connection. If the first perceptron sees the input ( x 1 = 1 , x 2 = 1 ) ,it will send a massive inhibitory pulse to the second perceptron, causing that unit to output 0regardless of its other inputs. If either of the inputs is 0, the second perceptron gets no inhibitionfrom the first perceptron, and it outputs 1 if either of the inputs is 1.Backpropagation NetworksAs suggested by Figure 7.20 and the Perceptrons critique, the ability to train multiplayer networksis an important sep in the direction of building intelligent machines from neuronlike components.Let’s reflect for a moment on why this is so. Our goal is to take a relatively amorphous mass ofneuronlike elements and teach it to perform useful tasks. We would like it to be fast and resistantto damage. We would like it to generalize from the inputs it sees. We would like to build theseneural masses on a very large scale, and we would like them to be able to learn efficiently.Perceptrons got us part of the way there, but we saw that they were too weak computationally. Sowe turn to more complex, multilayer networks.What can multiplayer networks compute? The simple answer is: anything! Given a set of inputs,we can use summation-threshold units as simple AND, OR, and NOT gates by appropriatelysetting the threshold and connection weights. We know that we can build any arbitrarycombinational circuit out of those basic logical units. In fact, if we are allowed to use feedbackloops, we can build a general-purpose computer with them.
  • 51. 228 ARTIFICIAL INTELLIGENCEThe major problem is learning. The knowledge representation system employed by neural nets isquite opaque: the nets must learn their own representations because programming them by handis impossible. Perceptrons had the nice property that whatever they could compute, they couldlearn to compute. Does this property extend to multilayer networks? The answer is yes, sort of.Backpropagation is a step in that direction.It will be useful to deal first with a subclass of multiplayer networks namely fully connected,layered, feedforward networks. A sample of such a network is shown in Figure 7.24. In this figure,xi, hi, and oi represent unit activation levels of input, hidden, and output units. Weights onconnections between the input and hidden layers are denoted here by w1 ij, while weights onconnections between the hidden and output layers are denoted by w2ij. This network has threelayers, although it is possible and sometimes useful to have more. Each unit in one layer isconnected in the forward direction to every unit in the next layer. Activations flow from the inputlayer through the hidden layer, then on to the output layer. As usual, the knowledge of thenetwork is encoded in the weights on connections between units. In contrast to the parallelrelaxation method used by Hopfield nets, backpropagation networks perform a simplercomputation. Because activations flow in only one direction, there is no need for an iterativerelaxation process. The activation levels of the units in the output layer determine the output ofthe network.The existence of hidden units allows the network to develop complex feature detectors, or internalrepresentations. Figure 7.25 shows the application of a three layer network to the problem ofrecognizing digits. The two-dimensional grid containing the numeral “7” forms the input layer. Asingle hidden unit might be strongly activated by a horizontal line in the input, or perhaps adiagonal. The important thing to note is that the behaviour of these hidden units is automaticallylearned, not preprogrammed. In Figure 7.25, the input grid appears to be laid out in twodimensions, but the fully connected network is unaware of this 2-D structure. Because thisstructure can be important, many networks permit their hidden units to maintain only localconnections to the input layer (e.g., a different 4 by 4 subgrid for each hidden unit).The hope in attacking problems like handwritten character recognition is that the neural networkwill not only learn to classify the inputs it is trained on but that it will generalize and be able toclassify inputs that it has not yet seen. We return to generalization in the next section.A reasonable question at this point is: “All neural nets seem to be able to do is classification. HardAI problems such as planning, natural language parsing, and theorem proving are not simply
  • 52. LEARNING 229classification tasks, so how do connectionist models address these problems?” Most of theproblems we see in this chapter are indeed classification problems, because these are theproblems that neural networks are best suited to handle at present. A major limitation of currentnetwork formalisms is how they deal with phenomena that involve time. This limitation is lifted tosome degree in work on recurrent networks but the problems are still severe. Hence, weconcentrate on classification problems for now. Figure 7.24: A Multilayer NetworkLet’s now return to backpropagation networks. The unit in a backpropagation network requires aslightly different activation function from the perception. A backpropagation unit still sums up itsweighted inputs, but unlike the perception, it produces a real value between 0 and 1 as output,based on a sigmoid (or S-shaped) function, which is continuous and differentiable, as required bythe backpropagation algorithm. Let sum be the weighted sum of the inputs to a unit. The equationfor the unit’s output is given by: 1 output = 1 + e − sumNotice that if the sum is 0, the output is 0.5 (in contrast to the perception, where it must be either0 or 1). As the sum gets larger, the output approaches 1. As the sum gets smaller, on the otherhand, the output approaches 0.
  • 53. 230 ARTIFICIAL INTELLIGENCE Figure 7.25: Using a Multilayer Network to Learn to Classify Handwritten Digits 1.0 1.0 0.5 0.5 0 0 Figure 7.26: The Stepwise Activation of the Perceptron and the Sigmoid Activation Function of the Backpropagation Unit (right)Like a perception, a backpropagation network typically starts out with a random set of weights.The network adjusts its weights each time it sees an input-output pair. Each pair requires twostates: a forward pass and a backward pass. The forward pass involves presenting a sampleinput to the network and letting activations flow until they reach the output layer. During thebackward pass, the network’s actual output (from the forward pass) is compared with the targetoutput and error estimates are computed for the output units. The weights connected to theoutput units can be adjusted in order to reduce those errors. We can then use the error estimatesof the output units to derive error estimates for the units in the hidden layers. Finally, errors arepropagated back to the connections stemming from the input units.Unlike the perception-learning algorithm of the last section, the backpropagation algorithm usuallyupdates its weights incrementally, after seeing each input-output pair. After it has seen all the
  • 54. LEARNING 231input-output pairs (and adjusted its weights that many times), we say that one epoch has beencompleted. Training a backpropagation network usually requires many epochs.Refer back to Figure 7.25 for the basic structure on which the following algorithm is based.Algorithm: BackpropagationGiven: A set of input-output vector pairs.Compute: A set of weights for a three-layer network that maps inputs onto corresponding outputs.1. Let A be the number of units in the input layer, as determined by the length of the training input vectors. Let C be the number of units in the output layer. Now choose B, the number of units in the hidden layer. As shown in Figure 7.25, the input and hidden layers each have an extra unit used for thresholding; therefore, the units in these layers will sometimes be indexed by the ranges (0,…..,A) and 0,…..,B). We denote the activation levels of the units in the input layer by xj, in the hidden layer by hj, and in the output layer by 0j. Weights connecting the input layer to the hidden layer are denoted by w1 ij, where the subscript 0 indexes the input units and j indexes the hidden the hidden units. Likewise, weights connecting the hidden layer to the output layer are denoted by w2 ij, with I indexing to hidden units and j indexing output units.2. Initialize the weights in the network. Each weight should be set randomly to a number between – 0.1 and 0.1. w 1 ij = random ( −0 . 1 , 0 . 1 ) for alli =0 ,....., A j =1 ,..., , B w 2 ij = random ( − 0 . 1 , 0 . 1 ) for alli = 0 ,...., B, j = 1 ,....., C3. Initialize the activations of the thresholding units. The values of these thresholding units should never change. x 0 =1 .0 h0 = 1 . 04. Choose an input-output pair. Suppose the input vector is x 1 and the target output vector is y i . Assign activation level to the input units.5. Propagate the activations from the units in the input layer to the units in the hidden layer using the activation function.
  • 55. 232 ARTIFICIAL INTELLIGENCE 1 hj = j =1 .... B A for all 1 + e ∑ i = 0 ij i − w1 x Note that i ranges from 0 to A. w 1 oj is the thresholding weight for hidden unit j (its propensity to fire irrespective of its inputs). x o is always 1.0.6. Propagate the activations from the units in the hidden layer to the units in the output layer. 1 oj = j = .... C 1 B for all 1 + e ∑ i = 0 ij i − w2 h Again, the thresholding weight w2oj for output unit j plays a role in the weighted summation. h0 is always 1.0.7. Compute the errors of the units in the output layer, denoted δ j . Errors are based on the 2 ( ) network’s actual output o j and the target output y ( ). j δ2 j = 0 j (1 −0 )( y j j −0 j ) for all j = 1 ,..... C8. Compute the errors of the units in the hidden layer, denoted δ j . 1 ( ) ∑ δ2 . w 2 C δ1 j = h j 1 − h j i ji for all j = 1 ,.... B i =19. Adjust the weights between the hidden layer and output layer. The learning rate is denoted η; its function is the same as in perceptron learning. A reasonable value of η is 0.35. ∆ 2 ij =ηδ j . h j for alli =0 ,.... B j =1 ,....., w . 2 , C10. Adjust the weights between the input layer and the hidden layer. ∆ 1 ij =ηδ j . x w . 1 j for alli =0 ,..... A j =1 ,....., , B11. Go to step 4 and repeat. When all the input-output pairs have been presented to the network, one epoch has been completed. Repeat steps 4 to 10 for as many epochs as desired.The algorithm generalizes straightforwardly to network of more than three layers. For each extrahidden layer, insert a forward propagation step between steps 6 and 7, an error computation stepbetween steps 8 and9, and a weight adjustment step between steps 10 and 11. Errorcomputation for hidden units should use the equation in step 8, but with I ranging over the units inthe next layer, not necessarily the output layer.
  • 56. LEARNING 233The speed of learning can be increased by modifying the weight modification steps 9 and 10 toinclude a momentum term α . The weight update formulas become: ∆w 2 ij ( t + 1 ) = η. δ2 j . hi + α∆w 2 ij ( t ) ∆w 1 ij ( t + 1 ) = η. δ1 j . x i . x i + α∆w 1 ij ( t )where hi , x i , δ1 j and δ j are measured at time t +1 . ∆w ij ( t ) is the change the weight 2experienced during the previous forward-backward pass. If α is set to 0.9 or so, learning speedis improved?Recall that the activation function has a sigmoid shape. Since infinite weights would be requiredfor the actual outputs of the network to reach 0.0 and 1.0, binary target outputs (the y i s ofsteps 4 and 7 above) are usually given as 0.1 and 0.9 instead. The sigmoid is required bybackpropagation because the derivation of the weight update rule requires that the activationfunction be continuous and differentiable.The derivation of the weight update rule is more complex than the derivation of the fixed-increment update rule for perceptrons, but the idea is much the same. There is an error functionthat defines a surface over weight space, and the weights are modified in the direction of thegradient of the surface. Interestingly, the error surface for multiplayer nets is more complex thanthe error surface for perceptrons. One notable difference is the existence of local minima. Recallthe bowl-shaped space we used to explain perceptron learning (Figure 7.22). As we modifiedweights we moved in the directions of the bottom of the bowl; we reached it. A back propagationhowever may slide down the error surface into a set of weights that does not solve the problem itis being trained on. If that set of weights is at a local minimum, the network will never reach theoptimal set of weights. Thus, we have no analogue of the perceptron convergence theorem forback propagation networks.There are several methods of overcoming the problem of local minima. The momentum factor o,which tends to keep the weight changes moving in the same direction, allows the algorithm toskip over small minima. Finally adjusting the shape of a unit’s activation function can have aneffect on the networks, the high-dimensional weight space provides plenty of degrees of freedomfor the algorithm. The lack of a convergence thereon is not a problem in practice. However, thispleasant feature of backpropagation was not discovered until recently, when digital computersbecame fast enough to support large–scale simulations of neural networks. The backpropagationalgorithm was actually derived independently by a number of researchers in the past, but it wasdiscarded as many times because of the potential problems with local minima. In the days before
  • 57. 234 ARTIFICIAL INTELLIGENCEfast digital computers, researches could judge their ideas by proving theorems about them , andthey had no idea that local minima would turn out to be rare in practice. The modern form ofbackpropagation is often credited to Werbos [1974], LeCun [1985], Parker [1985], and Rumelhartet al. [1986].Backpropagation networks are not without real problems, however, with the most serious beingthe slow speed of learning. Even simple tasks require extensive training periods. The XORproblem, for example involves only five units and nine weights, but it can require many, manypasses through the four training cases before the weights coverage , especially if the learningparameters are not carefully tuned . Also, simple backpropagation does not scale up very well.The number of training examples required is superlinear in the size of the network.Since backpropagation is inherently a parallel, distributed algorithm the idea of improving speedby building special–purpose backpropagation hardware is attractive. However, fast new variationsof backpropagation and other learning algorithm appear frequently in the literature e.g., Fahlman[1988]. By the time an algorithm is transformed into hardware and embedded in a computersystem, the algorithm is likely to be obsolete. Student Activity 7.3Answer the following questions.1. Describe Learning Automation with example.2. How neural networks are useful in learning?3. What are perceptions?Summary The most important thing to conclude from the study of automated learning is that learning itself is a problem-solving process. A learning machine is the dream system of AI. the key to intelligent behavior is having a lot of knowledge. Getting all of that knowledge into a computer is a staggering task of acquiring knowledge independently, as people do. We do not yet have programs that can extend themselves indefinitely. But we have discovered some of the reasons for our failure to create such systems.
  • 58. LEARNING 235 If we look at actual learning programs, we find that the more knowledge a program starts with, the more it can learn. This finding is satisfying, in the sense that it corroborates our other discoveries about the power of knowledge. Learning automata systems are finite set adaptive systems which interact iteratively with a general environment Research in machine learning has gone through several cycles of popularity. A learning program needs to acquire new knowledge and new problem-solving abilities, but knowledge and problem solving are topics still under intensive study. Genetic algorithm systems start with a fixed size population of data structures that are used to perform some given tasks. These structures are rated on their performance, and a new generation of data structures is then created If we do not understand the nature of the thing we want to learn. Learning is difficult. Not surprisingly, the most successful learning programs operate in fairly well understood areas (like planning), and not in less well-understood areas (like natural language understanding).Self-assessment QuestionsFill in the blanks (Solved)1. _________ is generally acquired through experience.2. A back propagation produces a real value between ______ and _______.Answers1. Knowledge2. 0, 1True or False (Solved)1. Learning from examples does not involve stimuli from the outside.2. A perception computes a binary function of its input.Answers
  • 59. 236 ARTIFICIAL INTELLIGENCE1. False2. TrueFill in the blanks (Unsolved)1. Catching is known as ___________.2. __________ is a process similar to macro-operators.3. _____________ is a process of assigning, to a particular input, the name of a class to which it belongs.True or False (Unsolved)1. Learning Automata systems are called Rote learning.2. Sequences of actions that can be treated as a whole are called macro-operators.3. An EBL system attempts to learn from a set of examples.Detailed Questions1. Would it be reasonable to apply Samuels rote-learning procedure to chess? Why (not)?2. Consider the problem of building a program to learn a grammar for a language such as English assume that such a program would be provided, as input with a set of pairs each consisting of a sentence and a representation of the meaning of the sentence. This is analogous to the experience of a child who hears a sentence and sees something at the same time. How could such a program be built using the techniques discussed in this unit?
  • 60. 236 ARTIFICIAL INTELLIGENCE1. False2. TrueFill in the blanks (Unsolved)1. Catching is known as ___________.2. __________ is a process similar to macro-operators.3. _____________ is a process of assigning, to a particular input, the name of a class to which it belongs.True or False (Unsolved)1. Learning Automata systems are called Rote learning.2. Sequences of actions that can be treated as a whole are called macro-operators.3. An EBL system attempts to learn from a set of examples.Detailed Questions1. Would it be reasonable to apply Samuels rote-learning procedure to chess? Why (not)?2. Consider the problem of building a program to learn a grammar for a language such as English assume that such a program would be provided, as input with a set of pairs each consisting of a sentence and a representation of the meaning of the sentence. This is analogous to the experience of a child who hears a sentence and sees something at the same time. How could such a program be built using the techniques discussed in this unit?
  • 61. 236 ARTIFICIAL INTELLIGENCE1. False2. TrueFill in the blanks (Unsolved)1. Catching is known as ___________.2. __________ is a process similar to macro-operators.3. _____________ is a process of assigning, to a particular input, the name of a class to which it belongs.True or False (Unsolved)1. Learning Automata systems are called Rote learning.2. Sequences of actions that can be treated as a whole are called macro-operators.3. An EBL system attempts to learn from a set of examples.Detailed Questions1. Would it be reasonable to apply Samuels rote-learning procedure to chess? Why (not)?2. Consider the problem of building a program to learn a grammar for a language such as English assume that such a program would be provided, as input with a set of pairs each consisting of a sentence and a representation of the meaning of the sentence. This is analogous to the experience of a child who hears a sentence and sees something at the same time. How could such a program be built using the techniques discussed in this unit?