Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Figure 1


Published on

  • Be the first to comment

  • Be the first to like this

Figure 1

  1. 1. DARPA MARS Robotic Vision 2020 program proposal: CACI Cover page Program Solicitation No.: Mobile Autonomous Robot Software BAA #02-15 Technical topic area: 1. Structured software modules, 2. learning and adaptation tools, 3. robot self- monitoring, 4. software components, 5. sensor-based algorithms, and 6. behavior software components and architecture structures. Title of Proposal: CACI: Cross-Platform and Cross-Task Integrated Autonomous Robot Software Submitted to: DARPA/IPTO ATTN: BAA 02-15 3701 North Fairfax Drive Arlington, VA 22203-1714 Technical contact: John Weng, Associate Professor 3115 Engineering Building Embodied Intelligence Laboratory Michigan State University, East Lansing, MI 48824 Tel: 517-353-4388 Fax: 517-432-1061 E-mail: Administrative contact: Daniel T. Evon, Director Contract and Grant Administration 301 Administration Building Michigan State University East Lansing, MI 48824 USA Tel: 517-355-4727 FAX: 517-353-9812 E-mail: Contractor’s type of business: Other educational Summary of cost: July 1, 2002 – June 30, 2003 July 1, 2003 – June 30, 2004 Total $1,619,139 $1,674,006 $3,293,145 1
  2. 2. DARPA MARS Robotic Vision 2020 program proposal: CACI A Innovative Claims for the Proposed Research The major innovative claims include: (1) Cross-platform integrated software. The proposed software is applicable to different platforms using a uniform API level to achieve software “Plug and Play” capability for every plug-and- play complaint robot body. This cross-platform capability is made practical due to the recent advances in biologically motivated “developmental methodology” and other related technologies. Although only indoor platforms will be tested, the technology is not limited to indoor platforms. (2) Cross-task integrated software. The team will develop various perceptual and behavioral capabilities for a suite of skills for a wide range of tasks, performed by robots individually or collectively. This cross-task capability is made practical by a combination of “developmental methodology” and other related technologies. (3) Highly perceptual robot systematically combining vision, speech, touch and symbolic inputs, to perceive and “understand” the environment, including humans, other robots, objects and their changes for behavioral generation. The proposed software will be able to detect, track, classify and interact with objects such as human overseers, by-standers, other robots and other objects. (4) “Robots keep alive” during overseer intervention. The robots continue to be fully “aware” of their operating environment and incrementally improve their performance even when a human intervenes. A human can interact with robots before, during or after task execution. This is made practical by the recent advances in robot “developmental” software and a novel integration with other machine learning techniques and the perceptual frame based servo-level controller. (5) Multimodal human interventions in the robot’s physical environment: from wireless remote control, to physical touch and block, to auditory and speech commands, to visual cues, to high- level goal alteration. Such a “mode rich” robotic capability is made possible by a novel integration of “automatic generation of representation” in developmental programs with the controller level perceptual frame based force-control. (6) Digital dense 3-D maps construction capability using both laser radars (ladars) and trinocular stereo cameras, for robot access and human use. Using ladars sensing, the proposed effort will use both range and intensity information to achieve real-time multi-layer scene representation and object classification. A multi-look methodology will be developed to produce a 3-D map using a low resolution pulse ladar sensor. An efficient multiple path ladar fusion algorithm will be developed to produce multi-layer 3-D scene representation. Using strinocular stereo camera sensing, the proposed effort will use radial image rectification, variable-step-size textured-light- aided trinocular color stereoscopy, very fast sensor evidence cone insertion, multiple viewpoint color variance sensor model learning, and coarse to fine resolution hierarchies. (7) Systematic methodology for quantitative assessment and validation of specific techniques in specific system roles. This team has access to a wide range of robot test beds to support extensive assessment, from simple mobile platform to the sophisticated Dav humanoid (its combination of mobile and untethered features is unique among all existing humanoids in the US). 2
  3. 3. DARPA MARS Robotic Vision 2020 program proposal: CACI B Proposal Roadmap The main goal: The work is to develop CACI: a Cross-platform And Cross-task Integrated software system for multiple perception-based autonomous robots to effectively operate in real-world environments and to interact with humans and with other robots, as illustrated in Figure 1 Figure 1: Future robot assistants for commanders? This schematic illustration is synthesized with pictures of real SAIL and Dav robots. Tangible benefits to end users: 1. Greatly reduced cost of software development due to the cross-platform nature of the proposed software. Although each different robot platform needs to be trained using the proposed software, the time of interactive training is significantly shorter than directly programming each different robot platform for perception-based autonomy. 2. Greatly enhanced capability of robots to operate semi-autonomously in uncontrolled environments, military and domestic. The proposed software is applicable, in principle, to indoor and outdoor, on-road and off-road, ground-, air-, sea-, and space-based. However, in the proposed effort, we will concentrate on a wide variety of indoor platforms, with some extension to outdoor on-road applications. 3. Greatly increased variety of tasks that robots can execute semi-autonomously: not just navigating according to range information while avoiding collisions, but also detecting, tracking, recognizing, classifying and interacting with humans, robots and other objects. For example, handing ammunition over to a soldier on request, warning of an incoming threat, and disposing an explosive ordinance. . 4. Greatly reduced frequency for required interventions by human overseers. Depending on the tasks executed, the interval of human intervention can be as long as a few minutes. Critical technical barriers: Most autonomous robots are “action cute but perception weak.” They can either operate under human remote control or programmed to perform a set of pre-designed actions in largely controlled environments, e.g., following a red ball, playing robot soccer or navigating in a known environment. However, their capability of responding to unknown environments (e.g., visual, auditory and touch) is weak. The main elements of the proposed approach: 3
  4. 4. DARPA MARS Robotic Vision 2020 program proposal: CACI 1. The proposed project will integrate a set of the most powerful technologies that have developed in our past efforts in the DARPA MARS program as well as elsewhere, for unknown environments, including various Autonomous Mental Development (AMD) techniques, Markov Decision Process (MDP) based machine learning, supervised learning, reinforcement learning, and the new communicative learning (including language acquisition and learning through language). 2. We will also integrate techniques that take advantage of prior knowledge for partially known environments, such as detection, tracking and recognizing human faces. This allows the robot to perform these more specialized tasks efficiently without requirement for a long training process. 3. Our proposed innovative 3-D map construction takes advantage of photographic stereopsis, structured light and ladars for the best quality and the widest applicability possible at this time. 4. The unique integration technology characterized by a unified architecture for sensors, effectors and internal states and the “plug-and-play” methodology for various indoor and outdoor-road robot platforms. The basis for confidence that the proposed approach will overcome the technical barriers: We have successfully tested the proposed individual technologies in the previous DARPA MARS program or other prior projects. The proposed integration is however truly challenging. Our integration philosophy is to find the best merging boundary of each individual technology so that the capability of integrated system is not reduced to an intersection of individual applicabilities, but instead, increased to a multiplication or at least a summation of them. The nature of the expected results: 1. Unique: No other group that we know of has produced our scale of robot perception results (vision, audition and touch integrated up to fine subsecond time scales). No other team has the wide variety of indoor platforms as ours (e.g., all other humanoids in the US are immobile). 2. Novel: Our AMD approach and the associated human-robot interaction capabilities are truly new, along with other novelties in component techniques. 3. Critical: No autonomous robot for an uncontrolled environment is possible without generating representation from real-world sensory experience, which is the hallmark of AMD. Defense environments are typically uncontrolled, very different from lab settings. The risk if the work is not done. If the proposed work is not done, the ground mobile weapons of the future combat system (FCS) will continue to rely on humans to operate, putting humans at a full risk in the battle field. Further, miniature of mobile weapons is limited by the human size if a human operator has to be carried in it. The criteria for evaluating this progress includes the following major ones: (1) the frequency at which human overseers need to intervene, (2) the scope of tasks that the technology can deal with, (3) the scope of machine perception, (4) the flexibility of human robot interactions, and (5) the cost of the system. The cost of the proposed effort for each year: Year 1: $1,619,139 Year 2: $1,674,006 Total: $3,293,145 4
  5. 5. DARPA MARS Robotic Vision 2020 program proposal: CACI C Research Objectives C.1 Problem Description The research project will address mainly indoor mobile platforms, including non-humanoid and humanoid mobile robots. However, an indoor robot needs to perceive not only indoor scenes, but also outdoor ones. For example, an indoor robot must be able to “look out through a window” to perceive danger from outside. Further, in order to verify the cross-platform cross-task capability of the proposed software system, the domain of application to be tested will include not only complex indoor environments, but also outdoor flat driving surfaces. However, in the proposed effort, indoor tests will have a higher priority. We will evaluate the power and limitation of component technologies and the integrated system. C.2 Research Goals We propose that the following robot capabilities to be developed and integrated: • Robotic perception in uncontrolled environment, including vision, audition and touch, for various tasks. For example, detecting and recognizing humans, landmarks, objects, and body parts. • Robotic behaviors based on perception, including visual attention, autonomous navigation with collision avoidance, autonomous object manipulation, and path planning for various tasks. For example, guiding attention to moving parts or humans, navigating autonomously while avoiding obstacles, and picking up and delivering objects from a location to a destination. • Construction of 3-D world model and its application. We will integrate both laser-based direct range sensing and stereo camera based range sensing. The constructed 3-D map with intensity will be used by a human overseer for virtual walk-through and as an external 3-D digital map for a robot to “read,” similar to a human consulting a map, for planning tasks. • Human-robot interactions while keeping robot “awareness.” The integrated software enables a human overseer to intervene at any time, to issue a command, to improve an action, or to issue a warning. The real-time software is able to respond to human intervention within a fraction of a second without terminating its “awareness.” The proposed project will also reach the following integration and evaluation goals: • Integration goal: Develop an integration technology that is cross-platform and cross-task. • Evaluation goal: Develop a systematic method that is suited for quantitative assessment of the power and limitation of specific robot capabilities, including the above four categories of capabilities. 5
  6. 6. DARPA MARS Robotic Vision 2020 program proposal: CACI C.3 Expected Impact Volumetric sensing. Ladar is an all weather day and night active sensor. A real-time 3-D map generation and exploitation using ladar images can significantly simplify a robot’s route planning, cross platform cooperation, and information fusion missions. It will also provide critical prior information for robot perception task and can significantly reduce communication bandwidth and simplify human-robot interactions. It is a critical enabling technology for cross-platform, cross-task, and cross-environment robot operations. Our robots will navigate, employing a dense 3-D awareness of their surroundings, be tolerant of route surprises, and be easily placed by ordinary workers in entirely new routes or work areas. The 3-D maps built by our system, and 2-D plans derived from them, are suitable for presentation to human overseers, who could designate operationally significant locations and objects with point and click methods. Human overseers can also walk through the robot’s 3-D experience for better human-robot interactions. Automated detection and tracking of human faces, objects, landmarks, enemies and friends can greatly enhance a robot’s awareness of the environment, which in turn is essential for generating context sensitive actions. For example, identifying and understanding people dynamically will result in successful interactions. The “robot-keeping-alive” way of human-robot interaction completely changes the way in which human and robots interact as well as how robots interact with each other. Robots are no longer “dead” during human intervention. Instead they continuously experience the physical events, including human intervention to improve their future performance. Autonomous robots will learn through human interactions as well as their own practice. Multimodal parallel perception will enable, for the first time, autonomous robots to sense and perceive concurrently visual, auditory and touch multimodal environments. For the first time, these visual, auditory and touch perceptual capabilities are highly integrated with online generating context- appropriate behaviors. In other words, perception and behaviors are not two separate processes. These milestone advances marked overcoming major theoretical, methodological and technical challenges in our past work. The integration technology will greatly enhance the overall capability of autonomous robot operation in an uncontrolled real-world environment, indoor and outdoor, in ways that are not possible with existing task-specific and platform-specific technologies. The proposed evaluation technology will provide rigorous quantitative data on the capability of the proposed technologies as well as comparison data with other existing technologies. With clear understanding about the strengths and limitations, the system proposed here will enable autonomous perception-based robot technology to be available for CFS application by the year 2020. 6
  7. 7. DARPA MARS Robotic Vision 2020 program proposal: CACI D Technical Approach D.1 Detailed Description of Technical Approach D.1.1 System Architecture The proposed CACI software framework is illustrated in Figure 2. Figure 2: The software architecture of the CACI system. The robot software contains three coarse layers distinguished by the time scale of their actions: planning layer working on minute scale, perception-behavior layer working on second scale and servo control layer working on millisecond scale. The sensor inputs are available for every layer, depending on the need of each layer. The 3-D range/intensity map is constructed from ladars and trinocular video cameras. It serves as a digital 3-D site model (Chellappa, et al. 2001) available to the robot and the human overseer. The state integrator is the place to post state information required for inter-layer communication. It is divided into three sections, one for the state of each layer. Every layer can read the state of other layers from the corresponding area in the state integrator. The action integrator records actions issued from each layer to the next lower layer. A higher layer will issue action commands to be executed by only the next lower layer, but all the layers can read action commands from other layers if needed. Due to our decomposition of layers based on time scales, actions from different layers do not conflict. For example, when the deliberative layer wants the reactive layer to move forward, the reactive layer will try to move forward in minutes scale, although it might temporarily move side-ways in a short time period in order to avoid an obstacle. 7
  8. 8. DARPA MARS Robotic Vision 2020 program proposal: CACI A major strength of the CACI framework is that it is designed to work not only with cheap and low- dimensional sensors, such as sonar and infra-red sensors, but also high-dimensional, high data rate sensors, such as vision, speech and dense range map inputs. It also addresses several other challenges in autonomous robots. For example, sensory perception, multisensory integration, and behavior generation based on distributed sensory information are all extremely complex. Even given an intelligent human overseer, the task of programming mobile robots to successfully operate in unstructured (i.e., unknown) environments is tedious and difficult. One promising avenue towards smarter and easier-to-program robots is to equip them with the ability to learn new concepts, behaviors and their association. Our pragmatic approach to designing intelligent robots is one where a human designer provides the basic learning mechanism (e.g. the overall software architecture), with a lot of details (features, internal representations, behaviors, coordination among behaviors, values of behaviors under a context) being filled in by additional training. The CACI framework is designed with this approach in mind. The CACI architecture is also suited for distributed control among multiple robots. Each robot perceives the world around it, including the commands from the human overseer or a designated robot leader. It acts according to the perceptual and behavioral skills that it has learned. Collectively, multiple robots acting autonomously successfully to show desired group perceptual capabilities and context appropriate behaviors. A centralized control scheme has been proven not effective for multiple robots. D.1.2 Integration approach The proposed CACI is an integrated software system for a wide variety of robot platforms, a wide variety of environments and a wide variety of tasks. Such challenging integration is not possible without the methodology breakthroughs that have been achieved and demonstrated recently in MARS PI meetings and other publications. A component technology that has a very limited applicability and yet is not equipped with a suitable applicability checker is not suited for integration. The team members have developed systematic technologies that are suited for integration. For perception and perception-based behaviors, the thrust is the methodology of autonomous cognitive and behavioral development. For longer-time behaviors and planning, multiple technologies to be used include perception-based action chaining (PBAC), the Markov decision process (MDP) and the associated learning methods. 3D site models are used as external digital maps, external to the robot “brain.” Integrating these three methodologies as well as other well-proven techniques to be outlined in this proposal, the proposed CACI software system will reach our goal of: • cross-platform capability • cross-task capability for perception-based autonomous mobile robot software. The “cross-platform capability” means that the software is applicable to different robot bodies: • indoor and outdoor • on-road and off-road • ground, air, and under-water • earth-bound, space flight and space station • small, human and vehicle size. A particular hardware robot platform is best only to a particular type of environment, due to its hardware constraints, including sensors and effectors. A land robot cannot fly; a helicopter cannot dive into water. However, different robot bodies do not mean that their software must also be ad hoc, based on totally different principles. The same word processor can be used to write different articles. The same Window 2000 operating system (OS) can be used for different computers each with a different combination of 8
  9. 9. DARPA MARS Robotic Vision 2020 program proposal: CACI computation resources and peripherals. This is known as “plug-and-play.” Of course, “plug-and-play” for autonomous robot software is much harder. The “plug-and-play” or cross-platform capability, for autonomous robot software is based on the following well-known basic idea: encapsulate platform dependent parts into an application programmer’s Interface (API). From the software point of view, each robot contains three types of resources, sensors, effectors and computational resource (including CPU, memory, battery etc). From an application software point of view, different robot platforms simply mean different parameters for these three types of resources. For example, a camera class has resolution as a parameter and an arm class has the degree of freedom as a parameter. The amount of work needed for us to achieve software “plug and play” for robots is large, but it should be smaller than the counterpart for OS. This is because OS has already addressed most “plug and play” problems for computational resource, sensors, and effectors. For CACI, “plug and play,” only needs to be done for the application program level, which requires definition of object classes, including camera class, laser scanner class, robot limb class, robot wheel class, etc. A majority of these definitions have been completed and tested in the MARS program by the members of this team. In the proposed work, we will extend such a “plug and play” work to more sophisticated robots (e.g., Dav humanoid) and a wider variety of robots available to the team members. As long as the API of a robot platform is compliant to the “plug and play” specification (i.e., “plug-and-play” compliant or p-n-p compliant) the same CACI robot software can be used for many different robots through “plug and play.” The “cross-task capability” means that the proposed CACI software is not restricted by a specific task or a few pre-defined tasks. It is applicable to a wide variety of tasks. The type of tasks that the software can accomplish depends on the three types of resources, the quality of CACI software design and how the robot is trained (i.e., the Five Factors). The cross-task capability requires: (1) cross-environments capability . (2) cross time scales capability (3) cross goals capability The environment, time, and goal all vary greatly, too tedious and too difficult to be modeled by hand in terms of type and the associated parameters. The “cross environments”capability means that the technology is applicable to various immediate worlds. In a typical defense setting, little is known about environment before the task execution. The world around an autonomous robot changes all the time. A path along a corridor that was safe before can become dangerous if a section of the wall along the corridor has been blasted away, exposing the path to enemy forces. The “cross time scales”capability requires the robots to reason at different time scales, from a fraction of a second to several hours, at different abstraction levels, through micro-steps about how to make a turn to macro-steps about how to reach a destination. The “cross goals” capability implies that the robot must be able to deal with different goals and quickly adapt to new goals with minimal further training. For example, if the task requires the robot to move to a site behind an enemy line, short travel distance should not be the only goal. Avoid being exposed to hostile forces during travel is a much more important goal. Very often, a longer but safer route is preferred rather than shorter but more dangerous ones. Further, a robot must re-plan when new information arrives that requires an adjustment of the current goal. For example, when the commander says, “hurry up,” a different behavior pattern should be adopted, which may result in a partial modification of the planned route. 9
  10. 10. DARPA MARS Robotic Vision 2020 program proposal: CACI The “cross-task capability” is much harder to accomplish than the “cross-platform capability.” There is a very limited set of platform resources, in terms of the types of sensors, effectors and computational resources. They can all be well defined in terms of type and the associated parameters. For example, a video camera type has resolution as a parameter. However, robotic tasks involve much wider variation in the environment, the time scales and the goals. Lack of “cross-task” capability is a major reason for the “action cute and perception weak” phenomina of most existing humanoid robots and mobile autonomous robots in the US and Japan. To be action cute, one way is to program carefully. If the limbs have redundant degrees of freedom, however, direct programming becomes very difficult. Innovative works have been done for studying learning methods for training robots to perform actions with redundant body parts (e.g., Grupen et al. 2000, Vijayahumar & Schaal 2000). These studies of action learning through doing is an appropriate methodology that has its root in biological motor development. However, just like the fact that cross- platform is not as difficult as cross-task, producing perception strong robots is much more challenging than producing a sequence of actions that do not require much perception. However, an unknown environment poses a more challenging problem, compared with a known redundant body, for the following major reasons: • The model of the environment is unknown, while a model of a robot body is known. • The degree of freedom of a little known environment is much larger than that of a redundant robot body. The former is on the order of millions (sensory elements, e.g., pixels), while the latter is on the order of dozens. Of course, these millions of sensory elements are not totally independent, but we are unsure of their dependency, even when we have a range map. • An unknown environment changes all the time but a robot body does not change its structure even though it moves its limbs. Perception is still the bottleneck of autonomous robots after decades of research in perception-based autonomous robots. Recent advances in a new direction called autonomous cognitive development (AMD) (Weng at al. 2001) provided a powerful tool for dealing with the robot perception bottleneck. Battlefields are uncontrolled environments, from lighting, to weather, to scenes, to objects. Why can human perception deal with uncontrolled environment? As indicated by the neuroscientific studies cited in our recent Science paper (Weng et al. 2001), the following is what a human developmental program does: A. Derive processors with internal representation from physical experience: The representation includes filters, their inter-connections, etc. The physical experience is sensed by sensory signals. B. The processors process sensory signals and generate action outputs: Compute the response of all filters from real time sensory input (physical experience). The human brain performs (A) and (B) simultaneously and incrementally in real time (Flavell et al. 1993, Kandel et al. 2000). (A) is done incrementally and accumulatively from real-time inputs, while B is computed from each input in real time. Traditional approaches (knowledge-based, learning-based, behavior-based and evolutional) rely on a human programmer to design representation in (A) and the robot program does only (B). The traditional approaches are task-specific and environment-specific since a human programmer can only competently think of a particular task and a particular environment. Sometimes, human designed representations do contain some parameters that will be determined by data, and this process is known as machine learning. However, `the representation is designed by the human programmer for a specific task in a specific environment. Therefore, the learning-based approach is still task-specific. 10
  11. 11. DARPA MARS Robotic Vision 2020 program proposal: CACI With the new developmental approach, a human designs a developmental program that performs (A) and (B). The essence of the new approach is to enable a robot to generate representation automatically, online and in real time, through interactions with the physical environment. Since the developmental program can develop a complete set of filters for any environment while the robot is doing any task, the developmental approach is the only approach that is not task-specific and can deal with any environment. In other words, a properly designed developmental program can learn in any environment for any task. In practice, of course, simple tasks are learned by robots before more sophisticated tasks can be more effectively learned, like a new recruit in the army. A natural question to be raised here is: How much training does a robot need before it can execute a series of tasks? According to our experience in the SAIL developmental project, the time spent on the SAIL robot to demonstrate a series of breakthrough capabilities is approximately 5 to 7 hours, much less than the time spent on writing the SAIL developmental program (which is in turn a lot shorter than programming traditional ad hoc methods). The overall time for developing a robot using AMD is much less than any traditional perception method. Further, there is virtually no parameter hand-tuning required. It is worth noting that we do not require a robot user to train his robot, although it is allowed. Robot training is done in the robot production stage, and it does not need to be done in the robot deployment stage. A robot user receives well-trained robots, if he orders them from a commercial company. Further, the AMD technology is most useful to autonomous robotic systems for FCS, but it is also very useful for perceptual capabilities of any autonomous system, such as surveillance systems, target detection systems, and intelligent human-computer interfaces. The basic requirements of autonomous cognitive development for robots include: 1. Autonomously derive the most discriminating features from high dimensional sensory signals received online in real time (in contrast, the traditional learning approaches use human define features, such as colors, which are not sufficient for most tasks). 2. Automatically generate and update representation or model of the world (clusters of sensory vectors which form feature subspaces and the basis vectors of subspaces, etc.) of the feature spaces and their inter-connections incrementally and automatically (in contrast, the traditional machine learning approaches use human hand designed world representation and the learning only adjusts predesigned parameters). 3. Real-time online with a large memory. For scaling up to a large number of real-world settings and environments and for a real-time speed, self-organize the representation of perception in a coarse to fine way for very fast logarithmic time complexity (e.g., the Hierarchical Discriminant Regression (HDR) tree (Hwang & Weng 2000) used in the SAIL developmental robot). 4. Flexible learning methods, supervised learning, reinforcement learning and communicative learning, all can be conducted interactively and concurrently while a robot keeps “alive” during human intervention or instruction. The realization of the above four basic requirements achieves the revolutionary “cross task” capability, which includes “cross environments,” “cross perception-based behaviors,” “cross time scales” and “cross goals” capabilities. For example, because the representation (or model) of the world is generated through online, real-time interactions between the robot and its environment, instead of hand designed by a human programmer, the robot software is applicable to any environment. For example, Weng and his coworkers have demonstrated that the SAIL robot can autonomously navigate through both indoor and outdoor environments (Weng et al. 2000 and Zhang et al. 2001) guided by its vision using video cameras, a concrete demonstration of our cross-environment capability. No other robot software that we know of has ever demonstrated capability for both indoor and outdoor navigation. 11
  12. 12. DARPA MARS Robotic Vision 2020 program proposal: CACI The integrated 3-D map is not used as a robot’s internal “brain” representation, since such a monolithic internal representation is not as good as distributed representation for robot perception, as discussed by Rodney Brooks (Brooks, 1991). Instead, they are used as digital maps stored externally outside the robot’s “brain” accessible to robots and humans. The robot and human overseer can refer to, index and update the digital 3-D map. To retrace the trajectory to provide retrotraverse, route reply, “go to point X” and other capabilities. D.1.3 Evaluation approach The “cross task” capability does not mean that our software is able to do any task. As we discussed before, the five factors determine what tasks a robot can do and how well it does them. The proposed CACI system is applicable of various behaviors, including autonomous navigation, collision avoidance, object recognition and manipulation. However, different tasks require different amounts of training. Currently, tasks that can be executed within a few seconds (e.g., 1 to 10 seconds), which include most robot perception tasks, can be trained using AMD methods. Other tasks that take more time, such as path planning task, AMD may require a considerable amount of training. In contrast, a hand designed model is more effective, such as MDP methods. The proposed CACI system will integrate existing technologies according to the nature of the tasks. The evaluation approach includes the evaluation of following components: 1. The performance of each component technology. 2. The environmental applicability of each component technology. 3. The effectiveness of integration in terms of degree of increased capabilities. 4. The limitation of the integrated software. 5. The future directions to go beyond such limitation. The criteria for evaluating progress include: (1) the frequency at which human overseers need to intervene, (2) the scope of tasks that the technology can deal with, (3) the scope of machine perception, (4) the flexibility of human robot interactions, and (5) the cost of the system. D.1.4 3-D map generation from ladars The robot platforms that we will experiment with are equipped with ladars for range sensing and 3-D site map integration. Ladar sensors typically collect multiple returns (range gates) and intensity associated with each valid return. The 3-D position of the surface can be computed based on the robot’s position, laser beam orientation, and the timing of laser returns. Figure 3 shows examples of ladar intensity map (left) and height map (right) generated from single path overhead ladar data. A better 3-D maps can be generated by fusing multiple paths and after spatial interpolation. Based on multiple returns of a ladar beam and/or multiple hits within a cell, we can generate multiple layer 3-D maps (average intensity, min/ max intensity, ground level height, canopy top level map, etc). Figure 3 shows examples of average intensity map (upper-left), ground level height map (upper-right), canopy top level map (lower-left) and color coded height map (lower-right). 12
  13. 13. DARPA MARS Robotic Vision 2020 program proposal: CACI Once we generate 3-D maps, we can use them as common reference upon which information collected from different sensors and/or different platforms can be registered to and form 3-D site model of the environment (e.g. Chellappa, et. al. 1997, 2001). Site model-supported image exploitation techniques can then be employed to perform robot planning and other multi robot corporation tasks. For example, given a robot location and orientation, ground viewed images can be generated from the site model. Such predicted images are very useful for accurate robot positioning and cross sensor registration. Multi layer 3-D representations are efficient for representing 3D scene of large area and for movable objects. Figure 4 shows examples of ground view images projected from ladar generated site model for a hypothetic robot view. We expect such ground view can be generated online at rate of several frame per second for 640x480 image size. Figure 3: 3-D maps after fusing multiple paths and post-processing such as spatial interpolation. Shown on upper-left is average intensity map, upper-right is ground level height map; lower-left is a canopy top level height map, and lower-right is a color coded height map. 13
  14. 14. DARPA MARS Robotic Vision 2020 program proposal: CACI Scene segmentation using both intensity and geometric information Figure 4: Examples of projected ground view images for sensor positioning and data fusion. D.1.5 Range map from trinocular video cameras Another sensing modality of 3-D map construction is stereo using parallax disparity. Our stereo system is built around 3D grids of spatial occupancy evidence, a technique we have been developing since 1984, following a prior decade of robot navigation work using a different method. 2D versions of the grid approach found favor in many successful research mobile robots, but seem short of commercial reliability. 3D grids, with at least 1,000 times as much world data, were computationally infeasible until 1992, when we combined increased computer power with 100x speedup from representational, organizational and coding innovations. In 1996 we wrote a preliminary stereoscopic front end for our fast 3D grid code, and the gratifying results convinced us of the feasibility of the approach, given at least 1,000 MIPS of computer power. From the 1999 to 2002, under MARS program funding we completed a first draft of a complete mapping program implementing many ideas suggested by the earlier results. The system uses trinocular color stereo and textured light to range even blank walls, choosing up to 10,000 range values from each trinocular glimpse. Each stereo range is converted to a ray of evidence added to the grid, generally negative evidence up to the range, and positive evidence at the range. The ray’s evidence pattern is controlled by about a dozen parameters that constitute a sensor model. A learning process that adjusts the parameters to minimize the color variance when scene images are projected onto the occupied cells of result grids greatly improves the map’s quality. A side effect of the learning process is an average color for visible occupied cells. Many internal images of our new grids, thus colored, can truly be mistaken for photographs of a real location (see Figure 5), and are clearly superior for navigation planning. This proposal would enable us to extend that start towards a universally convincing demonstration of practical navigation, just as the requisite computing power arrives. Our present good results were obtained from a carefully position-calibrated run of cameras through a 10 meter L-shaped corridor area. The next phase of the project will try to derive equally good maps from image sequences collected by imprecisely traveling robots. We have tested direct sampled convolution and FFT -convolution-based approaches to registering each robot view, encoded as a local grid, to the 14
  15. 15. DARPA MARS Robotic Vision 2020 program proposal: CACI global map. Both work in our sample data, but the dense FFT method gives smoother, more reliable, matches, but is several times too slow at present. We will attempt to speed it up, possibly by applying it to reduced-resolution grids, possibly by applying it to a subset of the map planes. When we are satisfied with mapping of uncalibrated results, we will attempt autonomous runs, with new code that chooses paths as it incrementally constructs maps. When the autonomous runs go satisfactorily, we will add code to orchestrate full demonstration applications like patrol, delivery and cleaning. Figure 5: Some views of the constructed 3-D dense map of the scene. The suitability of the grid for navigation is probably best shown in plan view. The image above right was created by a program that mapped each vertical column of cells in the grid to an image pixel. The color of the pixel is the color of the topmost cell in the largest cluster of occupied cells in the column. The plants in the scene are rendered dark because the low cameras saw mostly the dark shadowed underside of the topmost leaves. Using Data from Diverse Sensors. The evidence grid idea was initially developed to construct 2-D maps from Polaroid sonar range data. A sensor model turned each sonar range into a fuzzy wedge of positive and negative regions that were added to a grid in “weight of evidence” formulation. A later experiment combined stereoscopic and sonar range information in a single grid. The 1996 version of our 3-D grid code was developed for two-camera stereo, but was used shortly thereafter to map data from a scanning laser rangefinder, whose results were modeled as thin evidence rays. Grids can straightforwardly merge data from different spatial sensors. To get good quality, however, not only must the individual sensor models be properly tuned, but the combination of models must be tuned as well. Our color-variance learning method is suitable for directing the adjustment, if the sensor mix contains at least one camera. The spatial sensors, each with its own sensor model, build the grid, the sensor model parameters are evaluated by coloring the grid from the images, and the process repeats with the parameters adjusted in the direction of decreasing variance. Once good settings are found, they can be retained to construct grids during navigation. Different types of environment may be better captured with different parameter settings. A collection of parameter sets trained in different environments can be used adaptively if a robot carries a camera. Each possible model can be used to construct a grid from the same set of recent sensor data, and subjected to color variance evaluation. The model that gives the lowest variance is the most suitable in the given circumstances. D.1.6 Integration of range maps from ladars and trinocular cameras 3-D maps from ladar and trinocular stereo cameras will be integrated to give an integrated map with intensity. Three type of information will be used for integration: the registration of two maps, the resolution and the uncertainty of each source. To generate 3D maps from multi-view ladar images requires accurate sensor location and orientation. For indoor application, we plan to use the 3D map built using 3D grids of spatial occupancy evidence to obtain accurate laser sensor location and orientation, and then update the 3D map using laser returns. Laser images can be generated at a much higher frame rate and with better spatial and range accuracies. In particular, using our multi-look active vision approach, we can control the laser toward a region with low confidence scores from trinocular cameras, and update the 3D map and associated confidence scores in the region. Confidence of range obtained from ladar data can be measured from the relative strength of laser beams. 15
  16. 16. DARPA MARS Robotic Vision 2020 program proposal: CACI The trinocular stereo has a measure of uncertainty at each volumetric cell. Thus, integration using the volumetric information from the trinocular module will use Bayesian estimate for optimal integration. D.1.7 Perception The middle layer in the architecture shown in Figure 2 is the perception layer, which carries out the development of perception capability, performs perception and generates perception-based behaviors. An advantage of our approach is that perceptions for vision, audition and touch are all unified, guided by a set of developmental principles. We have extensive experience on computer vision, visual learning, robot construction, robot navigation, robot object manipulation, speech learning, including sound source localization from microphone arrays and action chaining. Our decade-long effort in enabling a machine to grow its perceptual and behavioral capabilities has gone through four systems: Cresceptron (1991 – 1995), SHSOLIF (1993 – 2000), SAIL (1996 - present ) and Dav (1999 – present). Cresceptron is an interactive software system for visual recognition and segmentation. The major contribution is a method to automatically generate (grow) a network for recognition from training images. The topology of this network is a function of the content of the training images. Due to its general nature in representation and learning, it turned out to be one of the first systems that have been trained to recognize and segment complex objects of very different natures from natural, complex backgrounds (Weng et al. 1997). Although Cresceptron is a general developmental system, its efficiency is low. SHOSLIF (Self-organizing Hierarchical Optimal Subspace Learning and Inference Framework) was the next project whose goal to resolve the efficiency of self-organization. It automatically finds a set of Most Discriminating Features (MDF) using Principle Component Analysis (PCA) followed by Linear Discriminant Analysis (LDA), for better generalization. It is a hierarchical structure organized by a tree to reach a logarithmic time complexity. Using it in an observation-driven Markov Decision Process (ODMDP), SHOSLIF has successfully controlled the ROME robot to navigate in MSU’s large Engineering Building in real-time using only video cameras, without using any range sensors (Chen & Weng 1998). All the real-time computing was performed by a slow Sun SPARC Ultra-1 Workstation. Therefore, SHOSLIF is very efficient for real-time operation. However, SHOSLIF is not an incremental learning method. SAIL (Self-organizing, Autonomous, Incremental Learner) robot is the next generation after SHSOLIF. The objective of the SAIL project is to automate the real-time incremental developmental for robot perceptual and behavioral capabilities. The internal representation of the SAIL robot is generated autonomously by the robot itself, starting with a design of a coarse architecture. A self-organization engine called Incremental Hierarchical Disriminant Regression (IHDR) was the critical technology that achieves the stringent real-time, incremental, small sample size, large memory and better generalization requirements (Hwang & Weng 2000). IHDR automatically and incrementally grows and updates a tree (network) of nodes (remotely resemble cortical areas). In each node is an incrementally updated feature subspace, derived from the most discriminating features for better generalization. Discriminating features disregard factors that are not related to perception or actions, such as lighting in object recognition and autonomous navigation. 16
  17. 17. DARPA MARS Robotic Vision 2020 program proposal: CACI Figure 6: Partial internal architecture of a single level in the perception layer The schematic architecture of a single level of the perception layer is shown in Figure 6. Three types of perceptual learning modes have been implemented on SAIL: learning by imitation (supervised learning), reinforcement learning and communicative learning. First, a human teacher pushed the SAIL robot around the Engineering Building several times, using its body pressure sensors mounted on its body corners. This is learning by imitation. The system generalizes by disregarding areas that are not important to navigation, using the HDR real-time mapping engine. The system runs at about 10 Hz, 10 updates of navigation decisions per second. In other words, for each 100 millisecond, a different set of feature subspaces are used. At later stages, when the robot can explore more or less on its own, the human teacher uses reinforcement learning by pressing its “good” or “bad” button to encourage and discourage certain actions. These two learning modes are sufficient to conveniently teach the SAIL robot to navigate autonomously in unknown environments. Recently, we have successfully implemented the new communicative learning mode on the SAIL robot. First, in the language acquisition stage, we taught SAIL simple verbal commands, such as “go ahead,” “turn left,” “turn right,” “stop,” “look ahead,” “look left,” “look right,” etc by speaking to it online while guiding the robot to perform the corresponding action. In the next stage, teaching using language, we taught the SAIL robot what to do in the corresponding context through verbal commands. For example, when we wanted the robot to turn left (a fixed amount of heading increment), we told it to “turn left.” If we want it to look left (also a fixed amount of increment), we told it to “look left.” This way, we did not need to physically touch the robot during training and used instead much more sophisticated verbal commands. This makes training more efficient and more precise. Figure 7 shows the SAIL robot navigating in real-time along the corridors of the Engineering Building, at a typical human walking speed, controlled by the SAIL-3 perception development program. 17
  18. 18. DARPA MARS Robotic Vision 2020 program proposal: CACI Figure 7: Left: SAIL developmental robot custom built at Michigan State University. Middle and right: SAIL robot navigates autonomously using its autonomously developed visual perceptual behaviors. Four movies are available at Figure 8 shows the graphic user interface for humans to monitor the progress of online grounded speech learning. Internal attention for vision, audition and touch, is a very important mechanism for the success of multimodal sensing. A major challenge of perception for high dimensional data inputs such as vision, audition and touch is that often not all the lines in the input are related to the task at hand. Attention selection enables singles of only a bundle of relevant lines are selected for passing through while others are blocked. Attention selection is an internal effector since it acts on the internal structure of the “brain” instead of the external environment. First, each sensing modality, vision, audition and touch, needs intra-modal attention to select a subset of internal output lines for further processing but disregard to leaving unrelated other lines. Second, the inter-modal attention, which selects a single or multiple modalities for attention. Attention is necessary because not only do our processors have only a limited computational power, but more importantly, focusing on only related inputs enables powerful generalization. Figure 8: The GUI of AudioDeveloper: (a) During online reinforcement learning, multiple actions are generated; (b) After the online learning, only the correct action is generated. 18
  19. 19. DARPA MARS Robotic Vision 2020 program proposal: CACI We have designed and implemented a sensory mapping method, called "Staggered Hierarchical Mapping (SHM)," shown in the figure below and its developmental algorithm. Its goal includes: (1) the generate feature representation for receptive fields at 0 16 32 48 different positions in the sensory space and with different sizes and (2) to allow attention selection for local processing. SHM is a model motivated by human early visual pathways including processing performed by the retina, Lateral Geniculate Nucleus (LGN) and the primary visual cortex. A new Incremental Principal Output of Component Analysis (IPCA) method is SHM used to automatically develop orientation sensitive and other needed filters. From Cognitive sequentially sensed video frames, the Mapping(HDR) proposed algorithm develops a hierarchy of filters, whose outputs are uncorrelated within each layer, but with increasing scale The architecture of sensory mapping, which allows of receptive fields from low to high layers. not only a bottom up response computation, but also To study the completeness of the a top down attention selection. The oval indicates the representation generated by the SHM, we lines selected by attention selector. experimentally showed that the response produced at any layer is sufficient to reconstruct the corresponding "retinal" image to a great degree. This result indicates that the internal representation generated for receptive fields at different locations and sizes are nearly complete in the sense that it does not lose important information. The attention selection effector is internal and thus cannot be guided from the “outside” by a human teacher. The behaviors for internal effectors can be learned through reinforcement learning and communicative learning. 19
  20. 20. DARPA MARS Robotic Vision 2020 program proposal: CACI D.1.8 Human and face tracking and recognition Due to the central role that humans play in human-robot interaction, the perception layer contains a dedicated face recognition subsystem for locate, track and recognize human faces. When the main perception layer detects a possible presence of a human face, the human face module is applied automatically. Prior knowledge about humans is used in programming this subsystem. This is an example how a general perception system can incorporate a special purpose subsystem. Figure 9: The face recognition subsystem. In the human face module, we plan to efficiently locate and track faces for authentication in a dynamic scene by using skin color and temporal motions of human faces and body. We propose a subsystem (see Figure 9) that detects and tracks faces based on skin color and facial components (e.g., eyes, mouth, and face boundary), estimation of face and body motions, and motion prediction of facial components and human body. In our approach, video color is normalized by estimating reference-white color in each frame. Detection of skin color is based on a parametric model in a nonlinearly transformed chrominance space (Hsu et al. 2002). Motion is detected by both frame differencing and background subtraction, where the background of a dynamic scene is smoothly and gradually updated. Facial components are located using the information of luminance and chrominance around the extracted skin patches and their geometric constraints. Parametric representations of these facial components can then be generated for motion prediction on the basis of Kalman filtering. Human bodies are detected based on the match of human silhouette models. The detected bodies can provide the information of human gaits and motions. Human faces are detected and tracked based on the coherence of locations, shapes, and motions of detected faces and detected bodies. The detected facial components are aligned with a generic face model through contour deformation, and result in a face graph represented at a semantic level. Aligned facial 20
  21. 21. DARPA MARS Robotic Vision 2020 program proposal: CACI components are transformed to a feature space spanned by Fourier descriptors for face matching. The semantic face graph allows face matching based on selected facial components, and also provides an effective way to update a 3D face model based on 2D images (Hsu & Jain, 2002b). Figure 10 shows an example of detection of motion and skin color (Hsu & Jain MSU 2002a). Figure 11 gives an example of tracking results without prediction (Hsu & Jain MSU 2002a). Figure 12 shows the construction of a 3D face model (Hsu & Jain 2001) and face matching by using 2D projections of the 3D model and the hierarchical discriminant regression algorithm (Hwang & Weng 2000). (a) (b) (c) (d) Figure 10: An example of motion detection in a video frame: (a) A color video frame; (b) extracted regions with significant motion; (c) detected moving skin patches shown in pseudocolor; (d) extracted face candidates described by rectangles. (a) (b) (c) (d) (e) Figure 11: An example of tracking results on a sequence containing 5 frames of two subjects is shown in (a)-(e) every 2 sec. Each detected face is described by an ellipse and a eye-mouth triangle. Note that in (d) two faces are close to each other; therefore, only face candidates are shown. (a) (b) (c) (d) (e) Figure 12: Face modeling and face matching: (a) input color image; (b) input range image; (c) face alignment (a generic model shown in red, and range data shown in blue); (d) a synthetic face; (e) 21
  22. 22. DARPA MARS Robotic Vision 2020 program proposal: CACI the top row shows the 15 training images generated from the aligned 3D model; the bottom row shows 10 test images of the subject captured from a CCD camera. All the 10 test images of the subject shown in the bottom row were correctly matched to our face model. Planner We have designed and implemented a hierarchical architecture and its developmental program for planning and reasoning at different levels. Symbolically, the perception based hierarchical planning can be modeled as, C c → C s1 → As1 → C s 2 → As 2 ⇒ C c → As1 → As 2 where C c is a higher level goal, C s1 and C s 2 are lower level goals which will lead to behaviors As1 and As 2 , respectively. → means “followed by”, and ⇒ means “develops.” The robot was first taught how to produce planned action As1 given subgoal C s1 and produce planned action As 2 given subgoal C s 2 . The robot is now given a higher level goal C c , without given subgoals C s1 and C s 2 . It is supposed to know how to produce behaviors As1 and As 2 , consecutively. Note that we called the above capability as action chaining, but the mechanism is the command in planning is the goal. For planning, each goal has alternative actions and the action is only recalled “in the premotor cortex” and is not actually executed. This new approach to planning can take into account rich context information, such as: • Context: when the time is not tight. Goal: go to landmark 1 from start. Plan: take action 1. • Context: when the time is tight. Goal: to go landmark 1 from start. Plan: take action 1a. • Context: when the time is not tight. Goal: go to landmark 2. Plan: take actions 1 and 2 consecutively. • Context: when the time is tight. Goal: to go landmark 2. Plan: take actions 1a and 2a consecutively. If each of the above line has only one learned sequence, we say that the corresponding planning scheme as been learned. Otherwise, the program will evaluate the performance of each plan and select the best one. This is the planner learning stage. In other words, the goal of planner learning is the train the robot so that each action sequence given each context leads is a unique When the condition associated with a plan has been changed, the planner will run to reach the best plan. The feature of this type of planning is to accommodate updated environmental conditions. Each of the above case is basically equivalent to the action chaining mechanism that we have designed and implemented. The major difference is that the action is unrehearsed but not executed. As we can see that only given a goal, the possible plans could be multiple. When the context (or new changes in the goal or performance measurements), the possible plan becomes unique. Figure 13 illustrates the information causality during the planning. Figure 14 shows how the SAIL robot learns planning through abstract composite goals (equivalently, commands). 22
  23. 23. DARPA MARS Robotic Vision 2020 program proposal: CACI Figure 13: Internal mechanism of the two-level abstraction architecture. Figure 14: Planning through abstraction of action sequences. D.1.9 Servo control The lowest layer in Figure 2 is the servo controller layer. Knowledge can be represented in the machine in the forms of a connectionist model, such as a neural network or a differential equation, as well as in the form of a symbolic model such as a rule-based system, a semantic network or a finite state machine. Furthermore, the human commands may also consist of two types, logic decision and continuous control. The key step of developing integrated human/machine systems is to develop a system model or knowledge representation which is capable of combining symbolic and connectionist processing, as well as logic decision and continuous control. In order to achieve the goal, the following specific problems must be investigated and solved: 1. Developing a perceptive frame: A machine, specifically an autonomous system has its special action reference. The tasks and actions of the system are synchronized and coordinated according to the given action reference. Usually, this action reference is the time. A task schedule or action plan can be described with respect to the time. It is understandable to use time as the action reference since it is easy to obtain and be referenced by different entities of a system. Humans, however, rarely act by referencing a time frame. Human actions are usually based on human perceptions. These different action references make it very difficult for developing a human/machine cooperated control scheme. A unified action reference frame to match human perceptions with the sensory information is the key combining human reasoning/command with autonomous planning/control. The important elements of human perceptions are "force" and "space" (geometric). They are directly related to human actions and interactions with the environment. The space describes the static status of actions, and force represents the potential or actual change in that status, which describes the dynamic part of the actions. Nevertheless, the space and force are also fundamental elements of machine actions. The essence of interactions between humans and machines can also be embodied by these two physical quantities. Therefore, force and space can be used as essential action references for an integrated human/machine 23
  24. 24. DARPA MARS Robotic Vision 2020 program proposal: CACI system. A perceptive frame, which will be developed based these action references, is directly related to the cooperative action of a human/machine system in that it provides a mechanism to match human perceptions and sensory information. As a result, the human/machine cooperative tasks can be easily modeled, planned and executed with respect to this action reference frame. 2. Combining symbolic/connectionist representation and logic/continuous control by Max-Plus Algebra model : A new system model based on the perceptive frame will be developed for analysis and design of task planning and control of integrated human/machine systems. The perceptive frame provides a platform to combine the symbolic/connectionist representation of the autonomous plan and control with human logic/continuous commands. The logical dependency of actions, and task coordination and synchronization can be modeled by a Max-Plus Algebra model with respect to the perceptive frame. This will facilitate an analytical method for modeling, analysis and design of human/machine cooperative systems. New analysis and design tools are expected to be developed for integrated human/machine systems described in a perceptive reference frame. As a result, the integrated human/machine systems will not only have a stable and robust performance, but also have behaviors which are independent with the operators of the systems. This is essential for an integrated human/machine system to achieve a reliable and flexible performance. 3. Designing a computing and control architecture for integrated human and machine system: The proposed planning and control scheme is based on the perceptive frame model. Time is no longer an action reference. Therefore, the system synchronization will entirely depend on the sensory information. This poses a new challenge for designing the computing architecture. A distributed computing and control architecture will be designed based on a multiple thread architecture. 24
  25. 25. DARPA MARS Robotic Vision 2020 program proposal: CACI D.2 Comparison with Current Technology D.2.1 Integration The existing traditional technologies are not suited for integration for the follow major reasons. 1. Individual component technology cannot work in an uncontrolled environment because human hand-designed features (such as color and texture) are not enough for them to distinguish different objects in the world. 2. Applicability of such an integrate system is low. Since each component technology only works in a special setting, intersection of these settings gives a nearly null set: there are almost no environmental situation under which these component technologies can all work. 3. In order for the integrated method to deal with an uncontrolled world, one must have an applicability checker, which determines which component technology works and which does not. Unfortunately, no such applicability checker for uncontrolled environment exists. This is called the applicability checker problem discussed in (Weng & Chen 2000). D.2.2 Evaluation In the past, robot software is for a specific task. The evaluation work here is, for the first time, for robot software that is cross-platform and cross-tasks. Therefore, the proposed evaluation work is new and is a significant advance of the state of the art. D.2.3 3-D Dense Map construction 3-D map construction and terrain classification using ladar data collected using helicopter is a relatively new problem. Under DARPA funded PerceptOR program, we have developed a ladar mapping software which is an order of magnitude faster than our competitors. The proposed effort will benefit from our previous experience. Ladars have been used extensively in manufacturing and outdoor environments. For indoor applications, low-power, eye-safe ladars can give accurate range measurements for quality control and robot positioning. For outdoor applications, ladars provide a convenient and accurate ranging capability, but are limited by atmospheric attenuation and some obscurants, so that accuracy suffers as range increases. Our approach is geared toward a direct-detection ladar with high range resolution but relatively low cross- range resolution (typical for robot/ground vehicle based ladar sensors) There are impressive programs that use clever statistical methods to construct and maintain two- dimensional maps from robot traverses. Most of these, unlike our 3D methods, run in real time on present hardware. The most effective use high-quality laser range data (primarily from Sick AG scanners). Laser ranges have far fewer uncertainties than sonar or stereoscopic range data, avoiding many of the difficulties that our grid methods were developed to solve. Yet no 2D method has demonstrated the reliability necessary to guide robots really reliably through unknown facilities. A major problem is hazards located outside the plane of the 2D laser scan. A secondary problem is the monotonous appearance of indoors when mapped on a constant-height plane: many areas strongly resemble other areas, and the local map configuration characterizes global location very ambiguously. Sebastian Thrun’s group at CMU has equipped a robot with a second Sick scanner oriented vertically. The motion of the robot adds a third dimension to the lasers 2D scan. Thrun’s group uses a surface-patch model to characterize the traversed architecture, and projects a camera image onto the surface patches. 25
  26. 26. DARPA MARS Robotic Vision 2020 program proposal: CACI The system is able to provide a 3D map suitable for human consumption in real time. In its present form the system navigates only by means of a second horizontal 2D scanner. The vertical scan provides 3D information only of positions that the robot has already passed. It should be possible to use the idea for navigation by, say, placing a scanner high, looking down ahead at 45 degrees. Yet we believe the approach has weaknesses. The planar representation becomes increasingly expensive and uncertain as objects become complex (e.g. plants). Since it has limited means to statistically process data and filter noise, the system depends on the clean signal from a laser rangefinder, which requires light emission. By contrast, our system benefits from textured illumination especially to range clean walls, but can work with passive illumination, especially in natural or dirty surroundings, where surface roughness or dirt provides texture. The following is an argument by analogy as to why we expect that grid methods will displace surface methods with near-future increases in computer power. D.2.4 Perception Designing and implementing a developmental program are systematic, clearly understandable using mathematical tools. Designing a perception program and its representation in a task-specific way using a traditional approach, however, is typically very complex, ad hoc and labor intensive. The resulting system tends to be brittle. Design and implementation of a developmental program are of course not easy. However the new developmental approach is significantly more tractable than the traditional approaches in programming a perception machine. Further, it is applicable to uncontrolled real-world environments, the only approach that is capable of doing this. Due to its cross-environment capability, SAIL has demonstrated vision-guided autonomous navigation capability in both complex outdoor and indoor environments. The Hierarchical Discriminant Regression (HDR) engine played a central role in this success (Hwang & Weng 2000). Although ALVINN at CMU (Pomerleau 1989) can in principle be applied to indoor, however the local minima and loss of memory problem with artificial intelligence make it very difficult to work in the complex indoor scenes. SAIL has successfully developed real-time, integrated multimodal (vision, audition, touch, keyboard and via wireless network) human-robot interaction capability, to allow a human operator to enter different degrees of intervention seamlessly. A basic reason for achieving this extremely challenging capability is that the SAIL robot is developed to associate over tens of thousands of multi-modal contexts in real-time in a grounded fashion, which is another central idea of AMD. Some behavior-based robots such as Cog and Kismet at MIT do online interactions with humans, but they are hand programmed off line. They cannot interact with humans while learning. The perception-based action chaining  develop complex perception-action sequences (or behaviors) from simple perception-action sequences (behaviors) through real-time online human robot interactions, all are done in the same continuous operational mode by SAIL. This capability appears simpler than it really is. The robot must infer about context in high-dimensional perception vector space. It generates new internal representation and uses it for later context prediction, which is central for scaling up in AMD. David Touresky’s skinnerbot (Touretzky & Saksida 1999) does action chaining, but it does it through preprogrammed symbols and thus the robot is not applicable to unknown environments. D.2.5 Face recognition subsystem Detecting and tracking human faces plays a crucial role in automating applications such as video surveillance. According to the tracking features used, various approaches to face tracking (Hsu & Jain 2002a) can be categorized into three types: (i) the methods using low-level features such as facial landmark points, (ii) the 2D template-based methods, and (iii) those using high-level models such as 2D or 3D (deformable) models. Most tracking approaches focus on a single moving subject. Few methods 26
  27. 27. DARPA MARS Robotic Vision 2020 program proposal: CACI directly deal with tracking multiple faces in videos. Although it is straightforward to extend the task of face tracking for a single subject to that for multiple subjects (e.g., finding the second large blob for the second target), it is still challenging to track multiple human faces with interaction in a wide range of head poses, occlusions, backgrounds, and lighting conditions. We propose a new method to detect (Hsu et al. 2002) and track (Hsu & Jain 2001a) faces based on the fusion of information derived from motion of faces and bodies, skin-tone color, and locations of facial components. Tracked faces and their facial components are used for face identification/recognition. The main challenge in face recognition is to be able to deal with the high degree of variability in human face images, especially with variations in head pose, illumination, and expression. We propose a pose-invariant (Hsu & Jain 2001) approach for face recognition which is based on 3D face model, and a semantic (Hsu & Jain 2002b) approach which is based on semantic graph matching. D.2.6 Planner Given a goal (which consists of a fixed destination and fixed set of performance criteria), the current popular MDP based planning methods (see, e.g., Kaelbling et al. 1996) for sequential decision require much exploration throughout the world model. When the destination is changed or the set of performance criteria (e.g., the weights between safety and distance) is modified, the past planning knowledge is not usable --- a new time consuming iterative training procedure through the site model must be redone. Online re-planning, modifying a plan after receiving new information about the task has been difficult. Further, hierarchical plan is necessary for planning at different abstraction levels. For example, to reach a target location, a robot needs to plan at a coarse level, such as reaching several landmarks. From one landmark to the next requires planning at a finer level, to reach landmark 1, how to make a turn, how to go straight, etc. Our perception-based action chaining work is precisely designed for such applications. D.2.7 Servo controller Human/machine cooperation has been studied for many years in robotics. In the past the research has focused mainly on human knowledge acquisition and representation (Liu & Asada1993, Tso & Liu 1993). It includes (i) robot programming which deals with the issue of how to pass a human command to a robotic system (Mizoguchi et al. 1996, Kang & Ikeuchi 1994). This process usually happens off-line, i.e. before the task execution. But humans have no real role during a task execution; (ii) teleoperation in which a human operator can pass action commands to a robotic system on-line (Kosugen et al. 1995). In the above two cases, the human operator has a deictic role and the robot is a slave system which executes the human program/command received either off-line or on-line. Recently, there is ongoing research on involving human beings in the autonomous control process, such as human/robot coordinated control (Al- Jarah & Zheng 1996, Yamamoto et al. 1996). The human, however, is introduced to the system in a similar role as a robot. In recent years, several new schemes have been developed for integrated human/ machine systems (Xi et al. 1996, Xi et al. 1999). Specially, the function based sharing control scheme (Brady et al. 1998) has been developed and successfully implemented in DOE's Modified Light Duty Utility Arm (MLDUA), which has been tested in nuclear waste retrieval operations in Gunite Tanks at Oak Ridge National Laboratory in 1997. In addition, the development of Internet technology has further provided a convenient and efficient communication means for integrated human/machine systems. It further enables humans and machines, at different locations, to cooperatively control operations (Xi & Tarn 1999). The theoretical issues related to integrated human/machine systems have also been studied (Xi & Tarn 1999). 27
  28. 28. DARPA MARS Robotic Vision 2020 program proposal: CACI E Statement of Work E.1 Integration The integration includes the following work: 1. Design and implement the API for plug-and-play. 2. Work with Maryland and CMU for integration of 3-D maps from ladars and from trinocular stereo. 3. Integration of the integrated 3-D map into the CACI software system. 4. The work proposed here will provide environmental perception in terms of humans using a model based approach to perceive both static and moving objects. Human objects will be dealt with exclusively by the face detection and recognition system because of the highly stringent requirements for correct recognition of humans. Other more general objects will be recognized by the perception level. 5. Work with the planning group, the perception group, the face recognition subsystem group and the servo controller group to design and implement state integrator. 6. Work with the planning group, the perception group, the face recognition subsystem group and the servo controller group to design and implement action integrator. 7. Design and implement the integration of the entire CACI software system. E.2 Evaluation The evaluation work includes: 1. Design of the specification of the evaluation criteria. 2. Design of the test specification for the performance of component technology. 3. Coordination of the tests for component technology. 4. Coordination of the test for overall system. 5. Collection of test data and report for the test. 6. Tools for robot self-monitoring to support the systematic assessment of perception and behavior performance in terms of quantitative metrics. E.3 3D map generation from ladars Ground robots/vehicles generally have difficulties in perceiving negative obstacles at distance. In this work UMD team propose to develop 3-D map generation software using overhead as well as ground laser range finders (ladars). Ladar sensors can capture both geometric and material information about the scene/environment and they operate at day/night and during all weather conditions. 3-D maps generated can then serve as a site model of the environment through which cross-platform, cross-task, and cross- environment fusion can be accomplished relatively easier. Using site model, many prior site information can be effectively incorporated in robot missions. 3-D map generation and exploitation is a key enabling technology in the proposed effort. UMD team proposes to develop the following technologies under this project: 1. Develop multi look ladar sensor control to generate higher resolution image of field of interest. A graphic user interface with ladar simulator will be developed and demonstrated for multi look sensor control. 2. Develop real time multi path ladar fusion and multi layer 3-D map generation algorithms. 3. Develop 3-D map supported dynamic robot positioning and video/ladar image registration algorithms. 28
  29. 29. DARPA MARS Robotic Vision 2020 program proposal: CACI 4. Support system integration and technology demonstration. To fulfill the work of dynamic face identification, we plan to implement the face recognition subsystem in three major modules: (i) face detection and tracking, (ii) alignment of face models, and (iii) face matching. The detection and tracking module first finds the locations of faces and facial components and the locations of human bodies and human gaits in a color image. Then it predicts the motions of faces and bodies for reducing the searching regions for the detected faces. The face alignment module includes the estimation of head pose, the 2D projection (a face graph) of a generic 3D face model, and the alignment of a face graph and the input face image. The estimation of head pose is based on the arrangement of facial components inside a face region. The 2D projected face graph is generated by rotating a 3D face model to the estimated view. The alignment of the face graph is based on contour deformation. In the matching module, an aligned face graph is first transformed into a feature space using facial (Fourier) descriptors, and then is compared with the descriptors of template graphs obtained from the face database. The face comparison is based on derived components weights that take into account the distinctiveness and visibility of human faces. E.4 3D map generation from trinocular stereo Using precision images collected in May 2001, we will continue to explore ideas to further improve and accelerate our 3D mapping program, and work towards applications, including those requiring substantial user interface. We've described a dozen pending developments in our most recent DARPA MARS report. In parallel we will develop a camera head for collecting a greater variety of new test data. The head will have trinocular cameras, a fourth training camera, a 100-line laser textured light generator, a 360 degree pan mechanism and a high-end controlling laptop. It will be mounted on borrowed robots, and in the second year be used to demonstrate autonomous and supervised navigation. The 3-D maps from the ladars will be integrated with the counterpart from trinocular stereo, to take the best possible result from both sensing methods. The above two parts of the work result in a integrated sensory-based algorithm that will support path referenced perception and behavior. It will provide a perception-based representation of the path at various levels of abstraction by combining the 3-D map and the output of perception from the perception level. E.5 Perception The perception involves vision, audition and touch, as well as the generation of perception-based behaviors for all the effectors of the platform. We have demonstrated a series of perceptual capabilities by the SAIL robot while allow a rich set of modes of operator intervention. The proposed work includes development of the following: 1. Tools for exploit operator intervention to enable the robot fully “experience” its operating environment even when the human operator intervenes. It will also allow operator intervention at a number of different levels, including behavior selection and perception selection. The higher planning level intervention will be accomplished by the planner. These new capabilities have been demonstrated by our current MARS project and we will make tools to be more integrated and more user friendly. 2. Tools for machine learning and adaptation. They support behavior selection, behavior parameter tuning, and perceptual classification. With the evaluation work discussed above, we will quantitatively assess and validate specific techniques in specific system roles. We have already provided some performance measurement of learning and adaptation techniques. In the proposed work, we will make them applicable to multiple platforms. 29
  30. 30. DARPA MARS Robotic Vision 2020 program proposal: CACI 3. Software components for interaction between robots and humans. They enable interaction with human operators as well as other robots, located in the robot’s physical environment. We will also provide a high-level command interface to view a group of autonomous robots, in cooperation with the planner and the controller levels. E.6 Face detection and recognition To fulfill the work of dynamic face identification, we plan to implement the face recognition subsystem in three major modules: (i) face detection and tracking, (ii) alignment of face models, and (iii) face matching. The detection and tracking module first finds the locations of faces and facial components and the locations of human bodies and human gaits in a color image. Then it predicts the motions of faces and bodies for reducing the searching regions for the detected faces. The face alignment module includes the estimation of head pose, the 2D projection (a face graph) of a generic 3D face model, and the alignment of a face graph and the input face image. The estimation of head pose is based on the arrangement of facial components inside a face region. The 2D projected face graph is generated by rotating a 3D face model to the estimated view. The alignment of the face graph is based on contour deformation. In the matching module, an aligned face graph is first transformed into a feature space using facial (Fourier) descriptors, and then is compared with the descriptors of template graphs obtained from the face database. The face comparison is based on derived components weights that take into account the distinctiveness and visibility of human faces. E.7 Planner We will develop the planner as an integration of the planning capability of our developmental method that we have demonstrated with an interface that allows a human overseer to supply updated performance criteria and a new goal to the planner for re-planning. The work includes: 1. Design the planner to integrate the SAIL planner with other existing technologies such as the MDP based planner. 2. Implement the planner based our prior work at DARPA. 3. Evaluate the strength and weakness of the proposed planner in allowing more intimate and real time interactions with the operator. 4. Test the planner with real-world planning programs to study the performance. E.8 Servo Controller The work for servo controller includes 1. Developing a human/machine cooperative paradigm to optimally map a task to heterogeneous human and machine functions. 2. Design a perceptive action reference frame for modeling an integrated human/machine system; 3. Developing a heterogeneous function-based cooperative scheme to combine autonomous planning/control with human reasoning/command in a compatible and complementary manner. 4. Developing a user-friendly human/robot interface to implement the human/machine cooperative planning and control methods. 30
  31. 31. DARPA MARS Robotic Vision 2020 program proposal: CACI F Schedule and Milestones F.1 Schedule Graphic (1 page) Q1, Y1 Q2, Y1 Q3, Y1 Q4, Y1 Q1, Y2 Q2, Y2 Q3, Y2 Q4, Y2 Integration Specification: Improvement: Tests: Demo: Evaluation Criteria: Refinement: Data: Demo: 3Dladar LadarS: MapGen: Posit: Integ: Demo 3Dstereo Prior: Vector: Local Navig: Demo: Perception Experience: Adaptation: Interaction: Integration Demo FaceSubsys RealT: Desg: Inteff:: Recog: Inte: Demo: Planner Design: Implement: Evaluate: Improve: Demo: Servo Paradigm: Frame: Hetero: Interface: Demo: 31
  32. 32. DARPA MARS Robotic Vision 2020 program proposal: CACI F.2 Detailed Individual Effort Descriptions Integration: • Specification: Design the specification for the performance measurement of the entire system. Translate the system-wise specification to component specifications. • Improvement: Perform several design iterations, according to the limit, strength, and the cost of the component technology. • Tests: Perform preliminary test with component technology to investigate the overall potential for improvement. • Demo: Finalize the integration scheme and make the demo. Evaluation: • Criteria: Investigate the criteria that the defense requires for real-world applications. • Refinement: Refine the criteria from testing of several component technologies. • Data: Test the component overall system’s performance and collect rigorous evaluation of performance. • Demo: Make integrated tests with evaluation and make the final demo. 3Dladars: • Ladar S: Develop a ladar simulation algorithm for algorithm development and performance evaluation. Develop algorithms for mosaic/fuse ladar images acquired from different location and/or with different sensor configuration. • MapGen: Multi layer 3D maps (floor, obstacles, reflectivity, etc) generation. Object classification based on intensity and geometry, and possible video images. • Posit: Autonomous robot to site model positioning algorithm. Autonomous video/ladar image registration algorithm. • Integ: Support integration of the 3-D maps for cross-platform, cross-task, and cross-environment operations. • Demo: Participate in the final demonstration and software demonstration. 3Dstereo: Year 1: Mapping program development Year 2: Applications demonstration development • Prior: Prior Probe, Two Thresholds, Imaging by Ray. Interactive Visualization, Camera Head Component Selection • Vector: Vector Coding, Camera Head Fixture Design and Assembly. Code Integration, Mapping Demo • Local: FFT Localization, Camera Interface to Robot 1. Field Image Tuning, Robot 1 Camera Head Data Collection Runs • Navig: Navigation Demo Code, Robot 2 Interface, Controlled Run Testing • Demo: Autonomous Runs, Code Cleanup, Final Report, Navigation Demo Perception: • Experience: Develop tools for exploit operator intervention to enable the robot fully “experience” its operating environment even when the human operator intervenes. • Adaptation: Develop tools for machine learning and adaptation. 32
  33. 33. DARPA MARS Robotic Vision 2020 program proposal: CACI • Interaction: Software components for interaction between robots and humans. • Integration: Integrate tools for machine “experience” intervention, machine learning and human robot interaction. • Demo: Modification, improvement and demonstration. FaceSubsys: • RealT: Speedup face detection for real-time applications • Desg: Design face tracking module using motion prediction. Design face tracking module using motion prediction • Inteff: Integration of face detection and face tracking. Construct face models • Recog: Build face recognition module • Inte: System integration • Demo: System integration/demonstration Planner: • Design: Design the planner to integrate the SAIL planner with other existing technologies. • Implement: Implement the planner based our prior work at DARPA. • Evaluate: Evaluate the strength and weakness of the proposed planner. • Improve: Test and improve the planner with real-world planning programs to study the performance. • Demo: Modification, improvement and demonstration. Servo: • Paradigm: Developing a human/machine cooperative paradigm to optimally map a task to heterogeneous human and machine functions. • Frame: Design a perceptive action reference frame for modeling an integrated human/machine system; • Hetero: Developing a heterogeneous function-based cooperative scheme to combine autonomous planning/control with human reasoning/command in a compatible and complementary manner. • Interface: Developing a user-friendly human/robot interface to implement the human/machine cooperative planning and control methods. • Demo: system integration and demonstration. 33
  34. 34. DARPA MARS Robotic Vision 2020 program proposal: CACI G Deliverables Description Integration: • Design documentation • Test data on our platforms, including Nomad 2000, SAIL and Dav. • Integrated CACI software. It will be the first cross-platform cross-task integrated system to autonomous robot. Evaluation: • Updatged literacture survey about performance evaluation. • Dodumentation of the evaluation criteria. • Complete evaluation data. Not just what a robot can do but also what it cannot do now and why. 3Dladars: • VC++ based ladar simulator and multi look automatic target recognition software • VC++ based 3-D map generation and terrain classification software. VC++ based dynamic robot positioning algorithm 3Dstereo: • Year 1: High quality stereoscopic 3D spatial grid mapping code able to process 10,000 range values from trinocular glimpses in under 10 seconds at 2,000 MIPS. • Year 2: Autonomous and interactive robot navigation code using 3D grid maps able to drive a robot at least one meter per five seconds at 5,000 MIPS. Perception: • Perceptual learning software for vision, audition, touch and behaviors. • Test data for the software for vision, audition and touch. • Documentation about the software and sample test data.. FaceSubsys: • C++/C based face detection software • C++/C based face tracking software • C++/C based face modeling software • C++/C based face matching software Planner: • Perception-based planner software for vision, audition and touch. • Test data for the planner software. • Documentation about the planner software. Servo: • Methodologies to optimally map a task to heterogeneous human and machine functions; • Methods and algorithm to compute a perceptive action reference frame for modeling an integrated human/machine system; 34
  35. 35. DARPA MARS Robotic Vision 2020 program proposal: CACI • Heterogeneous function-based cooperative schemes to combine autonomous planning/control with human reasoning/command in a compatible and complementary manner. • Software for human/robot interfaces and related documentations Patent: “Developmental Learning Machine and Method,” US patent No. 6,353,814, filed Oct. 7, 1998 and granted March 5, 2002. Inventor: J. Weng; Assignee: MSU. 35
  36. 36. DARPA MARS Robotic Vision 2020 program proposal: CACI H Technology Transition and Technology Transfer Targets and Plans We plan commercial development of the navigation code into highly autonomous and reliable industrial products for factory transport, floor cleaning and security patrol. The enterprise would welcome the opportunity to apply the techniques to DOD applications, should contracts materialize. Hans Moravec is the point of contact for this commercialization. We plan to commercialize the Dav humanoid robot platfform. The intended users are research institutions, universities, and industrial plants where the environments are not suited for human to stay in for a long time. John Weng is the contact person for this commercialization. We also plan to commercialize the CACI software for all types of autonomous robots. The plug and play feature is expected to attract many robot users. The cross-task capability of CACI will fundamentally change the way software is written for autonomous robots. John Weng is the contact person for this commercialization. 36